Quick takeaways
- Event-driven architecture (EDA) is powerful but tricky – it’s great for scaling and decoupling, but has many hidden traps.
- Observability is essential – debugging async systems without tracing, logs, and correlation IDs is almost impossible.
- Use the outbox pattern – it’s the safest way to publish events without losing data.
- Design events carefully – large, generic events can lead to tight coupling and painful refactors.
- Avoid over-engineering – sometimes synchronous systems or simple monoliths are just better.
- Start with sync if unsure – it’s easier to migrate from a well-structured synchronous system to async later than the other way around.
Introduction
In this episode of No Silver Bullet, we dive deep into the real-world challenges of working with event-driven architecture.
We share hard-learned lessons from building distributed systems with events, covering pitfalls like dropped messages, debugging, eventual consistency, and designing events.
We talk about the trade-offs, share practical advice, and help you decide when EDA is actually worth the complexity.
Show Notes
- Go Event-Driven Training
- Go in One Evening Training
- Watermill
- Duplicator Middleware
- Blog posts
- Wild Workouts example project
- pq - poison queue CLI
- watermill-sql
- Outbox example
- Debezium
- RabbitMQ
- Kafka
Quotes
If you have a big event that 10 teams depend on, you can’t just say, okay guys, we will replace this event with another one or with 10 other ones.
If you feel like you need distributed transactions between three services, maybe you should just merge them into one.
The good news here is that message brokers are usually designed so that you don’t lose the message, because it’s quite critical for you if you use EDA.
I heard horror stories about sending 10 millions of SMSs to one person within 10 minutes.
The point here is that we don’t accept inconsistency, we just accept that it happens a bit later. And it can be also difficult to explain to your product peers or your boss or whoever.
Unfortunately, you cannot test every edge case that you can have in your application. I mean, you can try, but in reality, it’s never enough time, as long as you’re not building really critical services.
Timestamps
- 00:00:00 - Introduction
- 00:02:53 - Events and messages
- 00:05:30 - Debugging and observability
- 00:11:11 - Missing events
- 00:19:40 - Eventual consistency
- 00:27:28 - Outbox pattern
- 00:31:37 - Designing events
- 00:39:14 - At-least-once delivery
- 00:47:47 - Dead letter queues and alerting
- 00:59:39 - Distributed transactions and sagas
- 01:08:49 - When not to use EDA
- 01:18:48 - Q&A
Transcript
Miłosz [0:00]: Event-driven architecture sounds great. You get better scaling, loose coupling, resilient systems.
Miłosz [0:06]: But it’s difficult to get right and can turn into over-engineering. Today we talk about the tough parts that make you wonder if going async was actually a good idea. I am Ior.
Robert [0:18]: And I’m Robert, and this is No Silver Bullet Live Podcast, where we discuss mindful backend engineering. We spent almost 20 years working together across different projects and teams. And we learned that but following advice like always do X or never do Y doesn’t work and can limit your growth. In this show, we share multiple perspectives that will help you to make smart choices and grow into principal engineer level.
Miłosz [0:45]: If you have any questions or comments, you can leave them on the chat. We’ll pick them up during the discussion or at the end we will also have a Q&A
Miłosz [0:55]: session where we can discuss all of it.
Robert [0:57]: Exactly. And today’s episode is kind of a continuation of the previous episode, so if you didn’t have a chance to listen to that, it’s highly recommended, but it’s not required. So, in the previous episode, we discussed the differences between async and sync architecture. Today, we’ll focus on one implementation of architecture that is async, so event-driven architecture. So this is some form of event-driven architecture. And you will take a look on challenges of applying event-driven architectures to your applications. Because we had a chance to work with multiple event-driven applications. So we know when it works nicely, when it doesn’t. So it will be helpful for you to decide if event-driven architecture is for you.
Miłosz [1:45]: In the previous episode, we assumed synchronous architecture is the default approach in many projects. Then we focused on some tips when moving to async is the idea. And today we focus on the challenges, because there are many with EDA. So maybe I’ll start with a very quick TLDR of what event-driven architecture is, which won’t be a complete definition, but just so we are on the same page. So, EventDriven architecture is based on events, obviously, and an event is some state of change in your system. And the idea is that your system or systems publish events after something happens, something changes in the system, and then other services or systems react to it. And can publish their own events, and this is how they communicate, in contrast to calling each other by some more direct ways. And this architecture is often built on top of messages, and that’s what we talk about today.
Robert [3:00]: It’s probably also worth mentioning that you have a message, and message is some kind of way how you are transporting events. So basically, in most cases, event is payload inside of the event. So you can think if you are, for example, in writing Go, so message is something like htRequest and event is part of the body, let’s say. So it’s the analogy here.
Miłosz [3:26]: So event is a kind of a message, very specific message that says that something already happened, some facts happened and then whoever listens is free to react as they see fit exactly,
Robert [3:42]: Okay, but let’s go to the challenges because I think every nice technique has some challenges so yeah, this is the NoCyverBluet podcast and.
Miłosz [3:56]: There are definitely many challenges when it comes to even driven architecture The first one can be debugging issues and the fear of losing events. And I actually have a quote that comes from Reddit comment we received from under the previous episode. And the commenter said, Stuff comes in, doesn’t come out. Things appear that you cannot explain where they came from. The amount of telemetry you have to add on your own to make those things sane is absolutely ridiculous in 2025.
Robert [4:31]: Yeah, probably it sounds like a person that maybe has seen some people misusing this kind of architecture.
Miłosz [4:37]: So you clearly see someone that struggled with this kind of approach before. But I would also say this critique is kind of fair. This can definitely happen. So it’s good to be ready for it and prepare.
Robert [4:55]: And I think it’s a common theme if you were discussing multiple different techniques in previous episodes. It was a common theme that people were always complaining about some techniques. Also in places when this was misused. Because using some technique for some approaches may be just too sophisticated, let’s say. And if you are not solving a real problem with any technique, you will just have more problems at the end.
Miłosz [5:27]: But even if you pick the right tool for the job, still can have this issue. Like, if you just call an HTTP endpoint, it will end up with some status code. If the other end doesn’t support the method, you will get 404. You will right away know what’s going on. Events, this is the hard part. If something goes wrong and you publish an event, and then the other end just doesn’t react to it. That’s the hard part, because you don’t know what happened.
Robert [6:01]: It’s kind of similar to using CI and tests. I mean, if you’re using CI and tests, you may complain that you have a problem that you need to write tests, they’re flaky, and you maintain them. And yeah, it’s true. But maybe the question is, if you have right tests, maybe you should remove some of them in this case. But it’s all about cost versus return of investment, basically.
Miłosz [6:25]: Yeah. The good news here is that message brokers are usually designed so that you don’t lose the message, because it’s quite critical for you if you use EDA. So I guess you could compare it to databases as well, but you don’t see people worried about SQL database losing the data, because it was designed to keep it, right?
Robert [6:55]: And usually, if somebody is complaining that a message broker is losing messages, it’s probably not a problem with the message broker. Let’s say that it’s similar to many people complaining like, oh, I think it’s a kernel bug or a programming language bug. in 99% of cases it’s not. Sometimes it is, but I would not start with assuming that it’s the case.
Miłosz [7:19]: This reminds me of this chapter in the Pragmatic Programmer about select just works. It’s exactly about this issue. So yeah, if you use the message broker the right way you shouldn’t have this issue. Of course it’s not that trivial sometimes to configure everything. But the good news is It is designed to help you not to lose any messages. But still, some weird stuff happens. I have quite recent example of this, where we have been running services on staging. And we got some events missing, just randomly. And it was super frustrating because we had proper telemetry and tracing and event log. And we just couldn’t figure out why this single event doesn’t arrive at the destination. And yeah, the intuition here is to blame the software, right? We were usingbiu RabitMQ at the time and we started wondering how can it be that it just randomly rejects some messages.
Miłosz [8:34]: And of course, this wasn’t the reason. The reason was our frontend team were running our services locally. And they just used the staging RabitMQ address to run it. So unknowingly, they were consuming our messages and the handlers were doing changes in the local database instead of the staging database.
Robert [8:57]: So let’s learn having access to non-development environment from local environment is always a great idea.
Miłosz [9:05]: Yeah, that’s one lesson learned for sure. And the other is, RabbitMQ probably works. So if you see some very weird issue with some event arriving. Probably look into your code, or in this case, who uses your code. Still, it doesn’t need to be super complex to have this observability in place. If you have a tracing setup, probably the best, because you can see the entire flow of the system. But I would say even a simple correlation ID in logs can help a lot. You can just trace which services receive the event you published.
Robert [9:53]: And I assume that if you are thinking about using event-driven architecture, you probably should have some kind of distributed architecture under a hood. And I assume that if you have, you should have already some observability for that in place. Some tracing or at least supporting correlation ID properly. If you are supporting it already for other transports like HTTP or gRPC, adding it for messages, it shouldn’t be hard.
Miłosz [10:28]: It’s also often skipped. Observability is not a super interesting topic. Sometimes if you build a product, it’s often an afterthought. And then you already have trouble if you have no tooling in place and you start using events and then you need to debug something. That would be a terrible idea. So, definitely start with this.
Robert [10:57]: There is actually a question. How you build on the chat? So just a reminder. So if you have any questions, please drop them on the chat. If not now, we’ll answer all of them at the end. But I think that this one is relevant. How you build your testing to catch dropped events like this?
Robert [11:15]: Basically, in this case, I would say that in most cases, what is helpful for us is using component tests. So there are a bit like end-to-end tests, but just on the level of your service. And there pretty useful for that. So in other words, you are calling some endpoint, you are assuming that something is processed, and at the end, it should be some output going from the system that is showing that something was processed. The bad thing is that usually when you are losing some events, it’s some small chance that it will happen. So it’s because of some misconfiguration or some edge case or some lost connection and it’s a bit harder to do that. So in this case, what’s useful is using some library that already have tests for that and is ensuring that no events are missed. Of course, you still can have some misconfiguration, but in most cases, unfortunately, it’s like you might notice something on the production and you need to fix it afterwards. But, for example, in Watermill, so the library that we’re developing, we have a lot of tests that are checking managed cases like disconnecting or doing some strange negative acknowledgement to ensure that no message is lost. And so far it works because, We just have a lot of tests for that.
Miłosz [12:42]: Regarding component tests, it’s important to use the same configuration you use in production. The exact same setup. Because it may be tempting to have something much simpler in tests, so it’s easier to just start. But this way you don’t test this entire topology, which can be sometimes quite difficult. Especially if you have something like a single event topic that routes events to some data lake or event log, something like this. It can be useful to do it for all services, but you should do it the same way in the tests, otherwise you’re not sure if you are testing the right thing.
Robert [13:30]: But in general, if you are using, for example, some libraries like Watermill, it should abstract it from you. And it’s hard to lose this message, basically, as long as you will not do things like something ever happened, and you will think, okay, everything is all right, I’m acknowledging the messaging going forward. But I would say that it’s just about being sane where you acknowledge some message that you are processing.
Miłosz [13:51]: Yeah, but testing is also not everything. You need this tooling to debug these production issues, because this issue of another team consuming your events by mistake, there’s no way you can do a test for it locally or in the CI.
Robert [14:08]: I think it’s probably not only the case for an asynchronous architecture, because it can also happen for anything basically. If someone is interacting with your staging, there’s no way that you can test that, and this is the reality.
Miłosz [14:21]: Yeah, but I think it’s easier to see the HTTP requests, right? Because you usually have logs on both sides. And in this case, you see the event is published, but you don’t see the other end at all. So that might be a bit more difficult to test.
Robert [14:41]: That’s true, that’s true. But yeah, TLDR, unfortunately, you cannot test every edge case that you can have in your application. I mean, you can try, but in reality, it’s never had enough time, as long as you’re not building really critical services. So, in this case, you can spend insane amount of time on ensuring that. But, yeah, it’s…
Miłosz [15:05]: Yeah, so to reiterate use observability from the start. Add some simple correlation ID in logs, tracing if you can. That will help a lot in debugging. And also don’t reinvent the wheel when it comes to using brokers. So for sure don’t implement your own queue from scratch. Probably not a good idea. Similar to not developing your own database. Use proven patterns, brokers and libraries that are tested and they work on production.
Robert [15:47]: And if you were mentioning patterns, we’ll today also mention the outbox pattern. And I think that in this situation, this is probably the most important one because this is a super common issue that people are not using the outbox pattern. And why it’s probably one of the simplest ways to miss events. But we’ll go into that.
Miłosz [16:06]: A good design of events can also help here. The more complex the workflow you have, the more difficult it will be to debug if anything wrong happens. So if you can keep your design simple and easy to grasp, it would be also much easier to debug any issues. So let’s say, if you have a single event that’s being consumed by tens of subscribers, instead of smaller events maybe that have fewer consumers, it might be more difficult to understand why some issues are happening.
Robert [16:55]: And you also mentioned that some people are kind of afraid of this losing message. And I think it’s also important to, you know, if you are building an asynchronous architecture, it doesn’t need to be event-driven, it can be also message-driven. You should just trust it very much. I mean, you should trust that if you are publishing some event, this event is later consumed. Because I heard multiple times that people were kind of not trusting in their infrastructure that they have for that. They feel that, okay, I’m metting some event. I’m not really sure if it will be later constant or whatever. So I would say that this is a red flag. I mean, it should be something that it should be as stable as your database. So when you are doing SQL query, you’re assuming that this database basically works. And this should be also the case for your event driven or message event applications.
Miłosz [17:50]: I wouldn’t like to work with a system that no one trusts the job done
Robert [17:57]: So, the next thing that I would say is also a pretty big challenge is maybe changing the mindset of how you are building your system. Because if you are building your system in a synchronous way, as somebody mentioned on the chat, it’s also kind of more straightforward. So you are doing some operation, you are storing something in the database, some external service is called, and you know that it’s done. And when you’re building an event-driven application, for example, there is some trade.
Robert [18:32]: There’s a use case that we discussed in the previous slide. In this use case, for example, we are reducing user and we are sending some data to some external system. But user registration is usually the critical part of the application. And if we are sending some data to external system, we cannot always trust that the system will be up, especially if you have out of traffic and we have some promotions. And let’s imagine that we have 1,000 users registering in one minute, for example, or even more, and it doesn’t work because some third-party system cannot handle that. It can happen, even if it’s not super big scale. But from the other side, it may be also important to send data to this external system. So, a nice solution, maybe using Event-Driven architecture here, because you can send this data asynchronously. But the downside is that the data that will go to the system may be not always consistent. So, it may happen sometimes that there is some bug, and instead of storing this data synchronously to the system, you’ll need to wait for a while before it will be stored there.
Miłosz [19:39]: And at first it seems like a big issue because how can I live without transactions? What to do now.
Robert [19:46]: But I would say that it’s important to think about the trade-offs here, because what would be the trade-off here? So instead of doing it within the request, if it’s delayed, it probably means that something is wrong.
Robert [19:57]: And if something is wrong, so this request will probably just not work. So it’s much better to have this request working, so your customers will not experience any degradation of your service, everybody will be happy and data will be just delayed this is a trade-off and this is also something that we often, heard when we’re discussing it with some business people that if you ask them that okay does this data need to be consistent and i’m sure that in 90 of cases the person will say yeah yes data need to be always consistent because why it shouldn’t be consistent what will happen if it will be delayed yeah.
Miłosz [20:35]: Yeah it’s like asking if it should be high quality
Robert [20:38]: Of course Yes.
Miłosz [20:39]: It needs to be.
Robert [20:40]: Yeah, boss, can we make bucks?
Miłosz [20:42]: What am I paying for?
Robert [20:44]: And, yeah, obviously, like you said, with quality, we can always write highest quality code. But is this worth? I mean, writing this highest quality code, it will cost more time. Maybe we don’t have this time. Maybe it’s fine to cut some corners and they vary it faster. And this is the same with consistency. So sometimes it’s better to be delayed if you compare it with something not happening at all.
Miłosz [21:11]: So this is the eventual consistency concept from EventDriven architecture. The point here is that we don’t accept inconsistency, we just accept that it happens a bit later.
Robert [21:25]: Eventually.
Miłosz [21:26]: Eventually, yeah. So it’s an important point and it can be also difficult to explain to your product peers or your boss or whoever. Because we normally this inconsistency we know is super small smaller than a second but sometimes it can be longer longer What’s surprising is, very often, it’s more acceptable than you would think. You could have inconsistent systems in reality, sometimes for hours even, and it’s completely fine. So this is more a product decision than purely a technical decision. Something you have to discuss with your stakeholders, basically.
Robert [22:15]: Yeah and i think it’s again always good to show trade-offs so okay it can be delayed but what if for example the system is down what we prefer maybe it’s fine that it will be a bit delayed but it will not stop entire entire users journey you can also for example say how it’s work on bigger social media platform for example so if you are posting a post to x facebook instagram whatever so you see that some post was posted but probably it’s taking a couple seconds before it will be shown in different people’s wall and it’s just because of adding it everywhere synchronously will be just super slow because probably it needs to happen in multiple regions and doing it synchronously it will just sometimes fail and probably it’s super rare that you are adding some posts somewhere and you see the error maybe it’s happening under the hood so for example maybe in some region, it’s added after five minutes, maybe after some hours even, you’d never know. But you don’t see it in the UI, because it’s just eventually consistent.
Miłosz [23:21]: It’s similar when changing your avatar or something like this. You can often see that many components on the website will be updated, others don’t. But also not in the technical sense, you can see it in real life, in how some systems work. Even bank transfers which seem like something that’s always consistent, right? Because you have the ledger. The traditional transfers are not immediate. So it’s a very similar concept.
Robert [24:00]: Yeah, but it’s important to keep in mind. So if you are discussing it with some product owner, some product person, it’s important to show that, okay, this is the trade-off and the UX implication of that. Because if you are adding something to your system, it may be not visible immediately for users. So it’s important to keep it in mind. Sometimes it’s worth asking a question if it’s really needed. If it’s needed to have it eventually consistent. We already mentioned it in the previous episode, but we are big fans of server sent events. So this is really cool technology for handling this kind of updates. Also, I think a link in the episode materials, our blog post showing how to implement it in production-grade environments.
Miłosz [24:52]: So it’s maybe another challenge here that you have to adjust your UI for this kind of errors. Something we also mentioned in the previous episode.
Robert [25:03]: The good thing is that there is also some midway here, because let’s imagine that we are creating some system that is adding some data to your system, and maybe it needs to synchronize some data with external systems. But there is fortunately a mid-air ground that often people are missing, I think. I have an idea why, because to do it properly you need to know a couple of some techniques, but from the other side, it’s not rocket science, especially if you do it already once. So, because if you are doing some changes, you can still store some changes to your database locally in the transaction and also emit an event. It has one challenge compared to that, but it’s nice from the UX perspective, because you can write your data to your source of truth database, that is, for example, database of your service, and you can also emit an event. And it’s nice because if operations exceeded, you can show it to the user immediately and you don’t need to do any magic with servers and events. It will be just there.
Robert [26:05]: But there is one trouble with this approach because, as I said, it assumes that we are storing something in the database and we are emitting an event. But what if something will happen in the middle? So, for example, we stored our data in the database, and our service died before sending the event. And before somebody saying that, oh no, the chance for that is super small, trust me, even for low-scale systems, it’s happening more often than you can think. It’s enough. And you may also say, oh, we have graceful shutdowns. Yes, as long as, for example, your service will not go out of memory, or you have some bug that will not make it shut down gracefully.
Miłosz [26:49]: The earnest network dice, which is super common.
Robert [26:53]: And you cannot do much about that. So it’s actually an interesting challenge to think. Okay, so in this case, we should store the data first to the database and emit our event to Message Broker, or vice versa. Maybe we should emit our event first and store data to the database.
Miłosz [27:10]: Or call the publisher inside the transaction.
Robert [27:14]: Or maybe in Thread and try to synchronize that.
Miłosz [27:18]: Yeah, so basically there is no good Almost no good way out here.
Robert [27:23]: In every situation, we’ll end up with inconsistency, basically.
Miłosz [27:28]: Yeah, but there’s an outbox pattern that you mentioned before, which is a nice way out of this.
Robert [27:35]: Yeah, and it’s also pretty cool in the simplicity. So basically, the idea of an outbox pattern is that the event that we would like to emit, you should store it in the same database as you are storing your data. For example, let’s imagine that your source of truth database is Postgres, So within one transaction, you are storing your event to Postgres, and you are also storing the data that you would like to send to your Postgres, and you are storing the event that you would like to event to this database within one transaction. And later, you are just streaming those events from this table with events to your method broker. And that’s it, basically.
Miłosz [28:18]: And the streaming happens in the background. So it’s essentially the main transaction.
Robert [28:22]: So it’s still eventually consistent, but well, like everything with event driven architecture, everything is eventually consistent. So it’s obviously a bit slower because there is one operation in the middle. So you need to stream it from your Postgres to your message broker. But well, trade-offs.
Miłosz [28:43]: The difficult part is streaming from databases is not the easiest thing to do. And there are some patterns how to do it, but still, it’s easy to make a mistake and, for example, Watermill SQL supports it, but took us three major versions to do it right and avoid some bugs. So it’s also another thing you probably don’t want to write from scratch. Probably better to use something that’s proven.
Robert [29:17]: If you are using Go, use Watermill for that, because Watermill supports Outbook’s implementation out of the box.
Miłosz [29:23]: There’s also a tool called Debezium, we used some time ago. It’s also quite difficult to configure, so there’s no easy answers here.
Robert [29:34]: You put it in a nice words. It was a nightmare.
Miłosz [29:41]: So once again, it’s something you don’t want to reinvent, probably, unless you have a lot of time and can write lots of tests and test it on production properly.
Robert [29:53]: And there are two alternative approaches here that I have in mind. The first one is that in some cases, actually, you can store data within transactions and later emit an event, but it’s only in a situation when it’s not a problem to lose some data in the event. For example, we are storing something in, let’s imagine, some telemetry or some continuous data.
Miłosz [30:18]: Yeah, why you don’t care about losing the message?
Robert [30:21]: Yeah, so in worst case, you may lose one data point, and it’s fine because a couple of seconds later, you’ll have another one. So this is fine. And maybe in this case, if you care about performance, it’s totally fine. The second interesting approach may be also if you’d like to avoid the streaming from PubSub to from your database to PubSub, you can just use your database as PubSub. So for example, in Watermill we have PubSub implementations for SQL, MySQL, Postgres, Firestore. So it can be also done, but it’s also a kind of specific implementation. I would say that by default you should use Outbox and stream it to your message broker. And I think we can probably link the source code of Watermill SQL in the episode notes. If you’re not writing in Go, you can take an inspiration. Ask AI, please rewrite this code in a way that it will not be copyrighted.
Miłosz [31:29]: If you don’t get scared of these SQL queries.
Robert [31:33]: Yeah, but don’t do it yourself. So there are a couple of queries there.
Miłosz [31:38]: Okay, let’s move to the next challenge, which is designing events.
Miłosz [31:44]: This is something that looks quite easy, or seems easy. Because it’s more soft than implementing some technical stuff. But it’s also quite difficult to get right. And especially because there’s no good universal answers here. For example, we often get asked if what’s better, should I have one big event or several or smaller events. Like the difference between user updated and having something like user password changed, user email changed and so on. And there is no good answer to this, because it has many trade-offs either way you go. So small events versus big events or generic versus specific and this kind of discussion there’s just no We’ve got one solution here.
Robert [32:45]: But I think it’s probably one case when it’s a pretty bad idea. So having an event when you’re just putting an entire entity from the database there. We’ve seen it in one system. It was actually pretty interesting archaeology. Because when we joined the component, the event was already like 50 fields. It was crazy big, and it was super hard to mock, and it was super hard to change that. But actually, a couple years earlier, when I checked in Git history, it was a pretty small event. It would have probably four fields or something like that. But it was the same model as it was stored in the database. It was growing, growing, growing, growing, growing. Like a bowling frog, basically.
Miłosz [33:25]: So you can imagine something like user updated with 50 fields in the event with all the details of the user.
Robert [33:33]: And even if you feel that at the beginning it’s not a big issue, maybe. I mean, also you should watch out too, you know, not overcomplicate days from day one.
Miłosz [33:41]: Exactly, that’s the hard part, right? Because, you know, say this is an anti-pattern, but let’s say someone has a small project they do, they write themselves. So then having even like this, maybe it’s good enough.
Robert [33:56]: Yeah, but later, let’s imagine that the company is still there and you have like 10 teams, 50 people, and you have no idea who’s using that, or you would like to remove some part of these fields, almost impossible. And again, it just started with four fields and putting database model directly to event. So again, try to find some balance, but watch out for some extremes that it can…
Miłosz [34:25]: So this is what I mentioned before regarding debugging and production. So we have this one huge event that everyone consumes. Maybe it’s more difficult to trace it and debug it, in contrast to smaller events. But this is just one heuristic. So there’s a question on the chat, what do you prefer? Granular or more composed? I’m not sure I can give one answer here.
Robert [34:57]: Yeah, but I think probably we can risk assessment, but that usually we have more problems with the bigger one, less granular. I think it’s probably easier to join events later than split them, basically.
Miłosz [35:18]: Yeah, because it’s easier to get tight coupling with bigger events, because many systems start depending on it, and then you can just… Yeah, you talk about refactor, and it’s not always that easy. Exactly because of this reason. If you have a big event that 10 teams depend on, You can’t just say, okay guys, we will replace this event with another one or with 10 other ones.
Robert [35:51]: And it’s even worse if one of the teams that depends on that is data science team. If one of those teams is data science, it’s impossible. Sorry.
Miłosz [36:02]: Yeah, refactor is easy if this is an internal event your service uses. That’s cool. But if it becomes a contract with other teams, yeah, that’s much more difficult. So, yeah, probably if you have more teams, it makes sense to start smaller, so at least you have this small contract. Maybe similar to API design of HTTP, probably you want to have a smaller method than one huge update method. It also depends on the API, but in terms of contract, it’s probably easier to track who is using each method rather than how people use this one golden method that does everything.
Robert [36:52]: And I think a good heuristic to see if your events are too granular are cases when you need to do some kind of aggregation. So, basically, to do something, you need to listen to two events and do some magic to, OK, I finally received two events and I can do operations. So, it may be the sign that, OK, maybe it’s too granular. So that’s the challenge.
Miłosz [37:19]: I would say, this lack of good universal answers here. Also, it’s not something very technical. You can just get better at it by reading a book. Most often you want to have discussions or event-storming sessions with other people in the team or in the company, with your stakeholders, to understand how to model it. So that also can be challenging.
Robert [37:48]: Also, watch out for not discuss it for a couple weeks, because I think at some point, probably it would be good to record an episode about some kind of trust to engineering things. But it’s another extreme that doesn’t help to gain trust in engineering things when you’re spending weeks or discussing something. Such stuff.
Miłosz [38:10]: That’s also a good point. So it’s easy to become a purist here as well and just decide we are not moving forward until we decide how exactly this event should be made it’s probably also an extreme you want to avoid
Robert [38:29]: To summarize I think one of the biggest challenges here is decoupling so decoupling is downside here because everything is that decoupled with event-driven architecture that sometimes it’s hard to really change something downstream.
Miłosz [38:46]: Or check how it’s used.
Robert [38:48]: Yeah.
Miłosz [38:49]: Especially if it’s… This is what you mentioned before about this, in contrast with HTTP, where you see the logs and, you know, who called it? If you just use events, it might be more difficult if you don’t have the tooling in place.
Robert [39:06]: Agreed. Okay so the next very common challenge that you can have when you are building event systems especially when you didn’t work with them is at least once deliver and this is, connected again to how most of the message brokers are built but tldr is that, most of the message brokers are offering you at least once deliver semantics so it means that you can receive every message at least once. It’s not possible that you will not receive this message, but you can basically receive it multiple times. This is connected to the problem that we were discussing a bit earlier with Outbox. What will happen if your consumer will lose network connection in the middle of processing the message? They are processing the message, you lose the network connection, somebody cut the fiber cable between data Or DNS.
Miłosz [40:13]: It’s always a DNS.
Robert [40:16]: And what message broker should do? He doesn’t know if you reprocess that, so you stored the data and committed the transaction already, or maybe not yet. And the assumption here is, in most cases, that message broker is assuming that you didn’t process that, and it’s red-delivered.
Miłosz [40:35]: Remember, message brokers are designed so you don’t lose the message, which is actually a good thing.
Robert [40:39]: Yeah, the downside is that you need to keep it in mind. So basically, if messages can be redelivered, you need to handle the situation when it’s delivered. The good thing is that it’s rocket science. So for example, if you’re a creating user, you’re basically checking if the user that you are creating already exists. It’s good to have some identity that you can match and deduplicate by that. Or sometimes you can have some duplication ID, but you need to keep it in mind.
Miłosz [41:10]: Maybe just reiterate why this is an issue. The event that create the user gets duplicated, let’s say it’s executed twice, right? So the first message will create the user, and the second one, or the second delivery of the same message, tries to insert the user into the database and it ends up with some kind of duplicate ID error on the SQL side. And this error will cause the message to go back to the queue, and it effectively blocks your queue.
Robert [41:47]: And it’s probably the best situation, because you can notice this error and fix that just maybe by adding to your queries like on WK do nothing and it’s fine.
Miłosz [41:58]: Yeah, sometimes it can have worse consequences. If, for example, this email will be delivered over and over again, and it’s usually within milliseconds. So if the handler does something like sends a message to the user via notification or text message or something like this, and for some reason it’s not stopped. You can do it thousands of times.
Robert [42:26]: Yeah, I heard horror stories about sending 10 millions of SMSs to one person within 10 minutes.
Miłosz [42:35]: Yeah. So that can be tricky. But more often you have an error. You can detect and fix.
Robert [42:44]: I think there is also one case when it’s worse and it’s hard to detect. So, for example, when you have some counters or something. So, when you have a counter and a message is re-delivered, it’s a pretty subtle error. So, it’s hard to detect because, for example, as we had the question on the chat earlier, how to detect some things and when. When, for example, this counter is not implemented properly, it may be hard to detect that because probably, maybe at the end, someone will notice that I’m, for example, charged much more than I should have.
Miłosz [43:20]: So this is a worse scenario where you don’t know that an issue occurred. You’ll discover it much later when it’s too late.
Robert [43:29]: But for that, actually, we have a pretty good tactic on how to avoid that. So when you have systems who are prone for this kind of issues, you can use, for example, in Watermill, duplication middleware. So basically, this is the middleware that is delivering every message twice. So basically, every message that you receive in your local environment will be delivered twice. And it’s fine when your system is not super heavy and you are able to run tests with that, because you are just 100% sure that everything is processed properly, because this is basically how it should work. You should be able to receive every message twice. Your system should work exactly the same as this message will be delivered.
Miłosz [44:10]: It’s a bit like chaos engineering, where you introduce some issues and purpose to the system.
Robert [44:18]: The downside is that if your operations are heavier, you probably need to have more runners in your CI or more powerful laptops.
Miłosz [44:27]: Or you could just do it in the tests.
Robert [44:31]: Or just for some handlers, this is also the case. But it’s not worth doing it for everything. I mean, it depends on the system, of course, but doing it for everything, it may be probably an overkill. For example, if you’re working with some financial domain, like we worked, for example, we just have this middleware enabled and we’re just handling every message twice. If you’re crazy, you can even do it with your production.
Miłosz [44:57]: Yeah, it should be safe to do it, actually. So the solution are idempotent handlers. So the handlers that you can safely execute twice or more times with the same input, and they won’t fail. In this case, in this example of the user being added, you just add onConflict, do nothing. Or you just somehow ignore the error. That’s a good application. And I know many people when they hear about it the first time are try very hard to figure out another solution. Because the idea of writing handlers like this is… I don’t know, annoying, maybe? They don’t like to do it. But the truth is, it’s not that complicated most of the time, right?
Robert [45:51]: Yeah, so if you’re inserting stuff just like that, it’s a bit harder with counters, because sometimes you need to have some table with the application ID, so you’re storing some message ID or operation ID. they are deduplicating it by that. One scenario when it’s harder is when you are calling some external systems. So for example, when you are charging customers, but in most cases, when you are using some same systems, they should have some idempotency key or something like that. And basically, in this case, you are just sending this idempotency key and it’s deduplicated on the other side. I think it’s showing nicely like, okay, you’re calling some external system that it’s not really even driven, but often this kind of system offers some kind of application because it’s not really event-driven, but it’s a very similar situation.
Miłosz [46:39]: Yeah, payment systems try to work like this most often.
Robert [46:43]: Yeah, because this is the same situation as you have with message brokers. So you have HTTP connection, it was interrupted before the HTTP status code was sent to you, and the system on the other side doesn’t know what to do. Especially when it’s important in cases when you, for example, are charging people. Nobody would like to be charged twice. As long as you are not the person that is charging.
Miłosz [47:08]: You can also see this idea in API design sometimes. If you have an HTTP handler that returns 200 or 201. So you can differentiate either by the response if the entity has been created or just everything is fine, nothing to do. It can also be helpful to handle this.
Robert [47:33]: That’s true.
Miłosz [47:38]: Okay. A bit related challenge here is messages that can’t be processed, so similar to this error. Because of the replicated message, you can have some message that has been published by mistake or can’t be unmartialed because of some issue. Basically it’s broken and whatever you do, it won’t be processed. And most often it will block your queue or this part of it.
Robert [48:15]: And I think it’s showing nicely the trade-off that we are making here with asynchronous or even Jumann architecture. So in normal HTTP endpoints, it would be not an issue. So it will just return 500, and it’s not a problem. But you are losing the resilience. But in this case, well, resilience is important, but you need to do something about this message.
Miłosz [48:37]: Yeah. And most often you want to move it out of the queue, so acknowledge it. But you probably also don’t want to lose it entirely, because you might have some important details or you don’t want to lose this information. So then you can move it to this separate queue, sometimes called the data queue or poison queue, or it just sits and waits for investigation. It sounds simple, but it’s also not trivial sometimes. I mean publishing to a separate queue is very easy, but you need to have some tooling to review those messages and move them back to the main queue if needed. And I think there’s no universal tooling here, depends on the PubSub. Some PubSUps have some admin interface, like RabbitMQ lets you pick messages. That’s one way to do it. But it can be also tricky. You can also provide your own tooling for it. We recently added a CLI like this for Watermill, for the SQL queue. So it can be useful to have some even simple tooling that lets you just review what’s there and has actions like delete or move to the main queue.
Robert [50:04]: It’s also worth mentioning, because we’ve been mentioning Watermill a couple of times, but if you are with us for the first time, you might not know, it’s not our product, it’s an open source library that we are just sharing for free and there is no enterprise additions. Everything that we mentioned is just free in open source. There was one episode, I think two episodes ago, we’ve been sharing more details about Watermill. So worth checking definitely if you are thinking about creating your own open source library, because Watermill is pretty successful in the Go community. So to give you some number, it has more than 8000 stars on GitHub and a lot of projects using that. So again, if you’re thinking about creating your own over-the-source project that is useful for other people, definitely worth checking.
Miłosz [50:55]: Yeah, and it can help you not reinvent the wheel. Like this CLI, for example. It’s much easier to use something that’s ready to use tend to figure out your own… I mean, maybe you can wipe code one, but I wouldn’t like to wipe code anything message-related. It’s kind of a risky area. If you lose a message, it might be permanent.
Robert [51:23]: Fortunately, we have a lot of tests, so it may be a pretty good case for that. So maybe some agent that is running our tests in the background, but it will have a lot of logs to process, so it will be expensive.
Miłosz [51:38]: Gabriel mentioned that it’s important to regularly check deadletter queue and analyze the reason of the issues. One more challenge here to have some alerting in place. You might have the deadletter queue in place, but if you don’t know the messages arrive there.
Robert [51:59]: I would actually say that I don’t fully agree, because I think you shouldn’t be checking your metrics. This metrics should let you know, because it’s just hard to check hundreds of metrics at the end.
Miłosz [52:13]: I mean, alerting, right?
Robert [52:16]: No, I mean, you should have alerting, but it’s making you free from checking the metrics.
Miłosz [52:26]: So, what you don’t agree with?
Robert [52:29]: I would say that you shouldn’t check metrics. I mean, alerting should notify you about… Oh.
Miłosz [52:35]: Yeah, yeah. Yeah, sure. Basically, you need a way to know that there’s something in the letter Q.
Robert [52:42]: Yeah, so it’s more like you shouldn’t be looking at your metrics like, oh, everything looks… Right, because at some point you just may have too much metrics.
Miłosz [52:52]: Ah, okay. I see now. You meant the regular check part. Yeah, yeah, okay. I see.
Robert [52:59]: Because probably someday it may come that, okay, not look on those metrics and something may happen. So it’s better to have something that will notify. But I assume that, yeah, Gabriel had this in mind. But I think it’s important to distinguish because I’ve seen some components that had metrics, but didn’t have alerting. And it’s better than nothing, but still, it’s At some point, you just may miss something.
Miłosz [53:25]: Unless you have someone looking at dashboards.
Robert [53:28]: You’re paying somebody just to, you know, his head is fixed and he’s looking all the day on the metrics.
Miłosz [53:34]: I know one company like that.
Robert [53:37]: What’s this company? Tell me later. All right. But I think that there is one thing that can help us with this deadletter queue and poison queues. Because, for example, I’m personally not that big fan of poison queues, that letter queues, because you need to move these messages back, and it’s a bit problematic sometimes. And one solution here may be proper message ordering. So basically, if you’re doing proper segregation of the messages, So in other words, a bug for one entity will not block all other entities. You don’t need to put this message to Poison Queue because it’s isolated and it will just spin for maybe one person, for one customer, for some small part of the system. But if you’re ordering all the messages in one big queue, actually one spinning event can block the entire system. And this is a pretty bad situation. In this case, you should have some kind of poison queue. But the problem with ordering is that it’s also the science of having good bias. So you shouldn’t have one queue for everything, because throughput will be small.
Robert [54:59]: Because one message will block everything. But from the other side, the good thing is that if you just have one big queue, you don’t need to care about proper order of events. So you just processing each event one by one and you don’t care about ordering. If you have everything unordered, it would be cool because it will not block anything else. But from other side, if you have events like addUser, removeUser and, this will come out of order, your read models that you have may be not right.
Miłosz [55:30]: Or in some special design to make it work.
Robert [55:33]: Yeah.
Miłosz [55:33]: Like versioning or whatever.
Robert [55:37]: Yeah. Again, we also were covering it more in the previous episode, but in my opinion, the most important here is to balance and trying to have as narrow ordering as you can. So not having super big queue, but also not have everything unordered. And it’s hard to give one hint how to do that, but probably you need to find the smallest unit that should be ordered and just have ordering with that. I think it’s also nicely compatible with the aggregate idea from Domain Driven Design. I think probably at some point we’ll also do an episode about that, but basically aggregate pattern from domain-driven design is all about that. How to find the smallest boundary of transactional boundary, but it’s kind of connected with that.
Miłosz [56:33]: One challenge here is also to realize you need ordering. Because if you’re just starting out, you might not think about it at all. And you might just assume the order in which your publish messages is the same as you receive, which can sometimes be the case. Okay, the next topic here is transactions again, about distributed transactions.
Robert [57:03]: Our favorite ones.
Miłosz [57:05]: So, yeah, this thing is like, you know, many people when they start with distributed systems or even different architecture and they split the operations into a few services, maybe they initially use events, but then realize they actually need this to be in a transaction. And the solution is obvious. We need the transaction to span multiple services. So you just have to orchestrate those handlers to work together, and if one fails, because you have to roll back the rest of the pipeline.
Robert [57:49]: Sounds like a great smart plan. What can go wrong?
Miłosz [57:52]: Exactly. If you are a smart engineer, you should be eager to implement this right away. It’s a sign you are a good programmer. But most of the time you should probably reconsider because the complexity will be tragic in the long run. I don’t even know what to start with, but maybe the easiest case to consider is, let’s say you have three services that communicate through events and you create this kind of distributed transaction or saga on top of them. And then the third one fails, so you roll back everything that happened before but then the rollback fails and what do you do now?
Robert [58:43]: I have PTSD already.
Miłosz [58:44]: So you’re stuck with this inconsistency in the system that It can be fixed. You have to go manually to the database or whatever storage you have and fix it somehow.
Robert [58:59]: And later implement compensating transactions for everything. You spend an enormous amount of time on it.
Miłosz [59:07]: So this is something that can be useful in some very specific cases, but most of the time you probably don’t want to deal with it.
Robert [59:15]: Very specific, probably it means super, super complicated system and many, many big teams.
Miłosz [59:22]: Probably it would be easier to just merge your three services into one and run a database transaction on top of it, rather than this thing. Or just embrace eventual consistency, like we said before, and just accept that the transaction doesn’t need to be here. Yeah.
Robert [59:42]: And again, it’s a different case when, for example, it’s some very complicated business project that is orchestrating between, for example, three teams where every team is 20 people. In this case, okay, it totally makes sense, but it’s a complicated process that probably each service in this case is very complex and you have some orchestration at a very high level. But if your team is 5 people, 10 people, and you are doing such a complicated transaction that is distributed, you can maybe ask yourself a question. Maybe it can be just one service. And I know that it’s not sexy to have less microservices, but at least it will save a lot of time on complexity.
Robert [1:00:27]: And, for example, we have a pretty nice exercise about that in our training, Go Event Driven. So, basically, when we’re introducing sagas, we’re not starting with implementing sagas. We’re actually giving exercise with services with saga implementation, and task is to remove saga, just to show how much it can simplify everything when you are removing saga, and how much complexity is added there just because somebody decided to do it as a distributed transaction. So I think it’s important when you’re using this kind of tool like Saga that can create a lot of complexity to actually know how you can hurt yourself with that and after that implementing that. Not the opposite way because I know it’s very easy to start to use that but if you would like to implement it properly it takes a lot of time and it needs to pay back.
Miłosz [1:01:25]: Yeah, so watch out for too complex architecture.
Robert [1:01:33]: We’ve been already covering a bit the topic of testing, so obviously it’s a bit different than testing synchronous endpoints, so we already mentioned that you need to take in, consideration, eventual consistency, you need to keep in mind that messages might be delivered more than once. The good thing is that you already know that you can have, for example, duplicate middleware that is duplicating everything, but it’s also making it a bit harder to test that. So it’s probably more similar to problems that you have on frontend. So you are clicking something and you need to check in some intervals if something happened. In this case, it’s similar. So you’re calling some endpoint, you see some that, for example, accepted, and you need to poll for some data that is happening under the hood. Maybe sometimes you can listen for events, but it’s also listening for some events and waiting after it will happen. Obviously, it’s adding some complexity. Rather, it’s not rocket science. I mean, if you did it once, you can…
Miłosz [1:02:39]: Yeah, experience helps. And also, if you create some helper, test framework or test library that makes it easy to publish and subscribe to events. It’s much more fun to work with.
Robert [1:02:54]: From our experience, component tests are nicely compatible with this approach. We’ll link how you can do component testing. We have an article about that,
Robert [1:03:04]: so we’ll link it in the materials of this episode. Because we know that a lot of people don’t know that something like component tests exists. DLDR, it’s nothing between end-to-end and integration tests, and it’s fitting this case pretty nicely.
Miłosz [1:03:19]: And lets you test the broker in the same way you run it on production, which is important? Sometimes you also don’t have access to the same exact broker you run. That can be challenging. Like if you use cloud-based Pub-Sub, you just can’t run it locally, right? You can use emulators, which often are good enough, but for some edge cases, they might be just not supported. And that’s the hard part here.
Robert [1:03:52]: But maybe you don’t need these cases from the other side.
Miłosz [1:03:55]: Yeah, if not, then it’s fine. But in general, it’s one more moving part into the local and CI environment, so it can be a bit more tricky. On the other hand, message brokers are usually not super heavyweight, maybe except for Kafka, which is often fun to set up in Docker.
Robert [1:04:20]: And you need a lot of memory for Heap or JVM.
Miłosz [1:04:24]: Yeah, but if you use something more lightweight, Postgres maybe, or Gravity Split Lightweight too, Redis is super lightweight. So it doesn’t have to be painful.
Robert [1:04:39]: Also, when you are creating this kind of tests, unfortunately, it’s easier to create a flaky test, because something is not happening immediately. And if you’re polling for something, there is more, bigger chance of flaky tests, basically. And in this case, it’s important to have good observability in place. So the thing that we were discussing at the beginning about tracing, logging is super important, and ensuring that correlation IDs are propagated properly. It’s also worth logging correlation ID in tests. So it’s easier to correlate the test failure to the logs that you have from Docker containers.
Miłosz [1:05:24]: Oh, yeah. Debugging those tests can be fun as well.
Robert [1:05:28]: But if you have logs, and there are good logs and correlation IDs are propagated properly, it’s much, much easier. LiveTip sometimes can also work to just get all those logs, put it to some AI and ask. Sometimes it can really help, basically, when you have a super big wall of text.
Miłosz [1:05:47]: Yeah, we need to integrate it as an agent.
Robert [1:05:51]: Enter Text…
Miłosz [1:05:54]: You don’t see the logs on the output, but just a single summary. It will cost you much.
Robert [1:06:00]: I run me off.
Miłosz [1:06:04]: Coming back to observability, maybe one more thing is we talk about alerting a bit of the data queue, but also a super important thing to monitor and have alerts on is the number of unacknowledged messages or the queue length. Because sometimes you might just quietly keep spinning on one handler and you don’t see the impact anyhow in the system. And then later you look at the logs and, oh my god, there’s 200,000 messages waiting for to be processed.
Robert [1:06:42]: And also Gabriel put the follow-up comment that he hates systems which are like a fridge.
Miłosz [1:06:49]: I need this on a t-shirt. For context, this is about system that you need to open to check it. So it makes a lot of sense. But I like the end result. I hate systems which are like a fridge. I hate systems which are cold inside. Cool. Should we move on to the summary?
Robert [1:07:27]: Yes, but maybe one more thing also.
Miłosz [1:07:31]: There are a
Robert [1:07:34]: Couple of challenges that we’ve covered there. Like every technique, like we mentioned, most of the techniques require some effort and extra cost to implement them. Yeah, and it’s not different for event-driven architecture. It requires some special techniques, and it’s important to use the right tool for the right problem. If you use event-driven architecture for systems that doesn’t require that, it will be just pure over-engineering. You’ll pay all the challenges that we just mentioned for nothing. You’ll just have one more problem in your Sometimes.
Miłosz [1:08:21]: Your system is just synchronous by nature, and that’s fine.
Robert [1:08:25]: Yeah, so your team is not super big, like for example mentioned with sagas. You don’t have super big scalability requirements. You are not integrated with an external system. You don’t have super big traffic spikes. In this case, maybe doing systems synchronously, it will be just enough. And if you’re in doubt, you can always start with Synchronous and migrate to a Synchronous one later. Like we mentioned in the previous episode, if you’re using some kind of abstraction over interactions to your system like clean architecture, it’s much easier because you have some layer that is allowing you to interact with your system and it doesn’t matter if it’s from HTTP request, when it’s a message. It’s making the migration much, much easier. And it’s also a common thing to ask your questions if you are doing some decisions. How you can reverse this decision? If it’s hard, ask yourself, okay, maybe there is some way how I can make this decision easier to be reversed. It’s worth, of course, and it can be helpful with tough decisions.
Miłosz [1:09:36]: Yeah, but if you are not sure, you can just start with sync approach and later migrate.
Robert [1:09:46]: Many traps that we mentioned today. The good thing is that it’s not that many traps for other sites. So I think it’s important to know about those traps and not reinventing the wheel. Because all the challenges that you may have with event-driven or asynchronous architectures, they are solved problems. And obviously you can try to solve them yourself. But I’m quite sure that you will not do it better than how programmers figured it out within the last 50 years.
Miłosz [1:10:21]: So it’s the hardest if you just start out. You don’t know what you don’t know, so you might try some naive approaches or just doing it yourself. You read about event-driven architecture and then you decide, okay, I will use events and I will send them via HTTP or I will have a long-living an HTTP connection between services or something like this. So, sure, this could work, but maybe just use a message broker, which was designed for exactly this thing. And thousands of companies used it on production for 10 years. So it’s very likely it will work out for you, if you will.
Robert [1:11:03]: And if you think you can do it better, I would recommend you to start learning what’s already there. And maybe you can build something better, but I think to do that, you need to actually know the state of art that is currently there. And I think it’s possible to still improve things, but again, I think it’s important to have a very strong base. But it’s only in case when you have some idea how to improve some stuff. When you’re normally working with some product, probably don’t have time for renovating the oil and improving stuff. What’s already there should be more than enough.
Miłosz [1:11:41]: It’s the cool part. That’s the difficult thing to do. It’s exactly like with sagas. You can convince yourself that it’s a good idea to create a distributed transaction framework for your systems. But maybe what you need to do is improve your product a bit to create something people want to buy. Which might be not as interesting in a technical sense. It’s probably the right thing to do for your company.
Robert [1:12:12]: But from another side, I think it’s also kind of the trap that you might think that doing some complex stuff like SAG or whatever may be interesting. Maybe at the beginning, but later you’ll need to maintain that. And I’m pretty sure, don’t ask me from where, from experience, that later you’ll need to maintain that. And it will be not interesting.
Miłosz [1:12:32]: Yeah, you know what, you need to build a framework and then leave the company.
Robert [1:12:37]: And it will be no longer a problem.
Miłosz [1:12:39]: Problem solved.
Robert [1:12:40]: Unfortunately, it’s often the case. But trust me, it’s much more fun to work with a system that is fairly simple, but you can iterate quickly and add features quickly there. And again, from our experience, using some boring, already well-known techniques and tools, gives a lot of fun at the end, because maintenance of that is much, much easier, and you can also do a lot of stuff.
Miłosz [1:13:08]: Moving fast is fun.
Robert [1:13:11]: For sure. Yeah, seeing how people are using your product and are happy about that, it’s also pretty cool. And I know that it’s not super technical from one side, but trust me, try this and it’s much.
Miłosz [1:13:22]: Much high-stay.
Robert [1:13:23]: And I think it’s also nice later when you’re looking for a job and you’re going to interview, and the interviewer will see that, okay, you’re a pragmatic person because you were not doing CV-driven development, but instead you were using the right tool for the right problem. So it’s, I think, often a really nice skill, and when we’re interviewing people, it was a big bonus point to see that. Okay, even if a person was just using very simple tech, but in proper context, and have big knowledge of the simple tools, it was always great to see it.
Miłosz [1:13:59]: There’s no silver bullet.
Robert [1:14:02]: Yeah. Sure. So, probably at the end we should say, is it worth using Evangelion architecture? And, well, unfortunately, it’s not that easy. So, as I said, there is no silver bullet, and there are many parameters for that. So, we believe that it’s most important to give you a lot of data points so you can decide on your own when using Evangelion architecture is good for you or not, to be prepared, the price that you need to pay, and to see if it’s fitting properly and if it will pay back, basically.
Miłosz [1:14:42]: Yeah, there are probably no shortcuts here, no easy answers, since you consider what you work with. You should have no all those challenges before you start, so you are prepared for things like we mentioned about at least once delivery or eventual consistency. It’s just a bit different mindsets than working with Synchronous approach. So if you have the basic theory in mind, and you spend some time working with the projects in a Synchronous way, you will be ready for handling it.
Robert [1:15:24]: I think the only silver bullet here is the fact that it’s not like it never works or it always works. I think it’s important to know what tools you have So not be a worker that only knows how to use Hammer, because this is not the best worker. So you should just know what tools are available and be prepared to use them in a proper context. Because using Synchronous Architecture Everywhere, it will not work. Using Asynchronous Architecture Everywhere, it will also not work. It’s just about finding the right tools for the right problem, because Opposite is not the best.
Miłosz [1:16:07]: Yeah, so if you want to learn more about Event Driven architecture in general, or in Go specifically, we have our training of Event Driven on sale for one more week. So this is a place where you learn hands-on with exercises, and we do it for the first time in a cohort-based way, so everyone learns together. You can still join us for one more week and then we will start an intensive month of learning together.
Robert [1:16:40]: Next one probably in half year or so. So, if you’re just listening to that and you’re listening for this now, we are super lucky. If you’re listening for this episode later, sorry. You can sign up to the waitlist.
Miłosz [1:16:58]: Okay. Before we move to the Q&A, it’s probably time you hit that subscribe button and leave a comment or a review, if you listen to the Minna Podcast application. And as Robert mentioned, you can subscribe to our newsletter on our blog, and we will let you know about the next episodes. Actually, we don’t have a topic for the next episode this time.
Robert [1:17:21]: It’s a secret.
Miłosz [1:17:22]: It’s a surprise.
Robert [1:17:24]: Yeah, but the one thing that we already know is that we’ll be changing the form of this live podcast a bit. So basically, we’ll do two more live episodes and later we’ll switch to sometimes doing live episodes. Most of the time, it will be pre-recorded because of a couple of technical challenges of doing it live. Like some workers that are super loud. Fortunately, our mics are pretty good and noise-canceling, but it’s super loud here.
Miłosz [1:17:57]: It’s 15 minutes before the start. They started drilling, so it’s super fun. We also want to make the sessions a bit more interactive. If you do live sessions, we will try to figure out something more interesting for you to follow.
Robert [1:18:09]: I think we’ll be running some Q&A sessions, definitely, between pre-recorded sessions. And we’ve also seen that most of you are listening to this after recording, so we’ll try to balance it a bit. And I think at the end also the content will be a bit better, because trust us, it’s not easiest to go over the scenario of the episode live, because we need to just keep flowing. But again, we’ll still have live sessions, so no worries, it will be just a slight change. But again, we’ll have two more live episodes, so don’t forget to join us.
Miłosz [1:18:44]: And the short summer break and then we’re back.
Robert [1:18:50]: Correct. Alright, so time for Q&A, I think.
Miłosz [1:18:58]: I think we discussed most of the comments before because they were relevant to what we discussed at the moment. There’s a recent question from Pedro. It’s about starting a monolith and breaking down into smaller surfaces and then adding events.
Robert [1:19:23]: AKA monolith first.
Miłosz [1:19:25]: And this is something we did once, more than once. Once specifically. So we had this project when we started with a quite small team, but then there was a perspective that it will be much bigger in the future. didn’t happen.
Robert [1:19:46]: At least we didn’t lose time for doing microservices.
Miłosz [1:19:49]: We wanted to be ready to scale but we also didn’t want to create distributed transactions at the beginning. So we started with our monoleaf and quite simple setup where we have modules that are completely separated and connected with code, basically, with interfaces between each other. So you can think like, you know, imagine microservices that expose HTTP API. So it’s the same idea, but instead you have modules that expose interfaces in code. And that’s, as you can see, that’s pretty much the same thing, just different transport. In this case, there is no transport, because we don’t need it. But we try to do it in way that, if needed, we could take any module out and it would be quite simple to deploy separately.
Robert [1:20:47]: But from the other side also, if your team is not that big, so like 10-15 peoples, maybe 20, it shouldn’t be that big problem, so you should be able to split it from anything. Okay, it depends on time frame basically.
Miłosz [1:21:04]: But we like to think of Monolith versus Microservices question as deployment, not about project structure.
Robert [1:21:15]: And also about some kind of social concept or concept of team. So basically, when you need microservices, you should probably need them because of how big is your team. So basically, you have one project and you have 50 people, for example, committing to one repository and deploying it.
Miłosz [1:21:34]: Yeah, maybe they block each other, maybe the CA becomes complicated.
Robert [1:21:39]: Somebody is introducing some flaky tests and 50 people are not able to deploy for an entire day. So yeah, this is a place where you need to split. But if your team is five, it’s not an issue. But anyway, if you’d like to see the implementation of the project that Miłosz mentioned, how we did it, we have blog posts about that. So we’ll include it in the episode materials tomorrow. And also, maybe one interesting thing. We also used to do the opposite one. We had microservices, and we did the microservices. We actually connected multiple microservices into one monolith, because it was just too granular, and it was too complicated to communicate between them. And having one service with everything, so similar situation to what we had with Saga, it would just make everything much easier.
Miłosz [1:22:34]: And Gabriel mentioned that onion architecture can help when you use the DD from the one. So onion or clean or hexagonal or whatever, layered architecture in general can help. For bigger projects especially. Also not silver bullets. A separate episode about this. So in this setup I mentioned this is exactly this idea that following hexagonal terminology let’s say the modules communicate via adapters and that’s it it’s very similar to communication with any RPC just with non-network connection and
Robert [1:23:18]: Also worth mentioning you don’t need to use DDD for on your architecture, you don’t need to always use DomainLayer. And it’s also totally fine. Because also often people join this and sometimes it’s creating some monster and people say, oh, architecture is overcomplicating. But often it’s because of that, because you don’t need to always introduce DomainLayer.
Miłosz [1:23:42]: Yeah, and each module can have a separate structure. One can use CleanArch, one can use DD and some can be one file.
Robert [1:23:51]: Or about that in the episode about clean architecture.
Miłosz [1:23:56]: Yeah, so Gabriel mentioned you can move faster with M&Oleaf. If there’s a small team,
Robert [1:24:02]: Then definitely.
Miłosz [1:24:03]: If there’s a small team, a single developer especially, if you’re a single developer and you have microservices, something is wrong.
Robert [1:24:12]: It’s not your fun project after hours.
Miłosz [1:24:16]: Yeah. We love, but we also worked with teams where there were more microservices than people on the team. And it’s not fun at all.
Robert [1:24:29]: Probably it was fun at the beginning, but later it was the opposite.
Miłosz [1:24:35]: At the end, DDD is the best way to go, says Pedro. It depends, as always.
Robert [1:24:43]: I’d say that probably depends which part of Domain-Driven Design, because many people are actually kind of not splitting Domain-Driven Design in two important parts. Domain-Driven Design are strategic and tactical patterns. Most people are talking about the tactical one that are in the code, But there is also a very important part of Domain Driven Design named strategic patterns. And actually, it’s totally not connected to the code. So it’s more about gathering requirements, talking with business, ensuring that the problem that you are solving is the right problem. And probably you should do it always, even if you are not naming it Domain Driven Design. Domain Driven Design is just giving you some tools for that. So if you are talking about strategic patterns, totally, you should do that, even if you are not naming it Domain Driven Design. What tactical patterns? Maybe. Sometimes.
Miłosz [1:25:36]: It all depends on the service or the module. What does it do? There’s a service that lets you upload your avatar. You probably don’t need to involve your stakeholders to discuss what file formats are supported or something like that.
Robert [1:25:53]: And have a separate service for that. We’ve seen it once.
Miłosz [1:25:57]: So, yeah. Once again, balance is important. And if you have this modular monolith with separate modules that are independent, it helps, because you can have one module that is super technical and you don’t care about the design here, and another that’s pure business rules and you spend a lot of time on design. So, modularization is the way to go. Probably not in the form of microservices, but some kind of separation is great to have.
Robert [1:26:39]: It’s also, in general, starting with microservices, there are cases, but probably there are cases when you are starting a pretty big development team, or for some reason you know that your development team will be big. And it’s not like yeah, yeah, totally, our, theme will be big and product will be super scalable every startup thinks this and probably five percent is really ending up there so and.
Miłosz [1:27:07]: It’s also not that easy to design a system that would be
Robert [1:27:11]: And be profitable 90 of them are spending time on setting up kubernetes and burning all the money and they are failing later five other five percent is failing because product is bad, and other 5% is lucky. And didn’t spend most of the time on setting up Kubernetes and Istio.
Miłosz [1:27:33]: You mentioned this in the first episode as well. It’s very hard to design a system that will be ready for all the future challenges. It’s better to evolve over time. Because you just can’t foresee how things will change.
Robert [1:27:48]: Easier to say than do, but again, check first episode. We have a couple nice ideas how to do there.
Miłosz [1:27:56]: I think there are no more questions. Pedro says we are doing a great job. Thank you very much.
Robert [1:28:03]: We are doing our best.
Miłosz [1:28:04]: Thanks for joining us. Thank you for all the questions. We will see you in the next episode about a secret topic.
Robert [1:28:14]: Exactly. In two weeks. So we’ll let you know in our newsletter. Just a reminder so go event driven you’ll be just able to join for another week next time or I’ll be healthier or so.
Miłosz [1:28:28]: If you are listening later you can join the waitlist
Robert [1:28:33]: Or go in an evening if you don’t go don’t know go yet because maybe a lot of you doesn’t know go and yeah we’ll.
Miłosz [1:28:42]: Leave links in the description yep oh right Thank you everyone for joining us for the questions. See you in two weeks.
Robert [1:28:52]: Thank you, Milosz. Thank you, everyone.
Miłosz [1:28:54]: Bye.