Twice a year, we run a hands-on training on Event-Driven patterns in Go. The time has come! Learn more.

Available until June 11

Quick takeaways

  • Start with synchronous architecture by default - it’s simpler to understand, debug, and maintain for most use cases
  • Async architecture improves scalability and resilience - message queues and events help handle traffic spikes and failures
  • Design matters more than the technology choice - tight coupling creates the same problems in both sync and async approaches
  • Consider team experience - async architecture require more experienced teams and better tooling to handle new challenges
  • Adjust as your system grows - external APIs, heavy operations, or the need to handle failures gracefully are good use cases
  • Hybrid approach - use both sync and async where they fit best, rather than forcing one over the other

Introduction

In this episode, we discuss when to choose synchronous versus asynchronous architecture for backend systems.

We talk about the trade-offs between simple, predictable sync communication and the complexity but resilience of async approaches using message queues and event-driven architecture.

Instead of picking one approach over another, we focus on understanding when each makes sense and how to avoid common pitfalls like distributed monoliths and over-engineering.

Show Notes

  • Go Event-Driven training
  • Watermill - our open-source Go library for working with message streams
  • Event Storming - a design technique with a great unfinished ebook by Alberto Brandolini
  • CQRS (Command Query Responsibility Segregation) - a pattern that works well with both sync and async approaches
  • Our CQRS article and Server-Sent Events post
  • Message brokers mentioned: RabbitMQ, Kafka, NATS, Google Cloud Pub/Sub
  • Event schemas: Protobuf, Avro, CloudEvents
  • Event Sourcing - a pattern mentioned in context of recreating state from events
  • Clean Architecture/Hexagonal Architecture - architectural patterns mentioned for making sync/async migration easier

Quotes

What’s the key difference here is the fact that when you’re publishing some message, you basically don’t care how it will be processed, by whom it will be processed. And it’s nice because it’s decoupling you from the consumers.

Robert

The bigger the system, the worse it becomes. For example, if many services depend on this one, which is a bottleneck, it can kill your entire platform.

Miłosz

Decoupling is not always good. It’s similar like with Don’t Repeat Yourself. Often, don’t repeat yourself may be a good idea, but not always. If you go into extreme, it may be problematic.

Robert

The worst case here is when you ignore the issue. You assume the database commit never fails. And it might be true most of the time, but usually if something can happen, it will happen on production sooner or later.

Miłosz

Event is information about the fact that happened. So it’s not something that will happen, may happen. No, it’s information about something that happened and it cannot fail.

Robert

It’s just counterintuitive, because you think that splitting things makes things decoupled. But it’s not really the case. It depends on how you communicate, really.

Miłosz

Timestamps

Transcript

Miłosz [0:00]: Should we make this async? It’s one of the questions that seems simple,

Miłosz [0:04]: but can make a huge difference in the long run. If you get it wrong, you’ll spend months fixing the mess. How to decide? You should not just guess. I am Iwos.

Robert [0:13]: And I’m Robert. And this is No Silver Bullet Live podcast, where we discuss mindful backend engineering. We spent almost 20 years working together across different projects and teams. Learn that following advice like always do X or never do Y doesn’t work and can limit your growth. In this show, we share multiple perspectives that will help you to make smart choices and grow you into principal engineer level.

Miłosz [0:38]: And if you have any follow-up questions, you can leave them in the chat. We can pick them up during the discussion if they are relevant. Otherwise, we will have a Q&A session at the end. So we will go through all the questions.

Robert [0:54]: And to be on the same page, today we’ll discuss async in context of architecture, not async in context of your code. So we’ll not discuss threads, we’ll not discuss goroutines, we’re not discussing code level, async, we’ll discuss the higher level architecture, so related to pubs, queues, and similar systems.

Miłosz [1:21]: Yeah, maybe the context would be we have two services, microservices or programs in general, two or more, and we want them to talk to each other over the network in some way. That’s what we want to focus on today. And usually, it works like one of the services asks for something, makes a request, and then waits for the result. And you have to handle the error in some way, if it happens, or basically check the status code, which is the main difference between sync and async.

Miłosz [2:08]: And yeah maybe one specific case is where you fire and forget so you make the request and then you don’t care about what happens later in the other service but I think we can agree not to talk about this case today because it’s pretty rare but you don’t care about what happened on the other end if you lose the status code, the response, you don’t know if we can continue, or should you let the user know what happened? Let’s skip this one for now.

Robert [2:55]: If you are talking about synchronous architecture, as contrast, in most cases we are talking about our gRPC, HTTP architecture, so in other words, you’re calling from one place, you’re calling one service, and within the request context, some operation is done. So maybe some database, something is written to the database, maybe some transaction is done, maybe some money is moved, or something like that, so basically…

Miłosz [3:27]: Yeah, transferred to another account. Something happens in the other service.

Robert [3:35]: Yeah, and basically you know that if this request returned you the some 200-like HTTP code or gRPC status code that it succeeded, you know that it succeeded. And it’s nice because it’s the typical architecture that we are building. So if you are learning how to code, usually you are using synchronous architecture. You are not using event-driven architecture or as a broker, because it’s probably something more complex for somebody who’s learning. I guess, Milos, when you were learning how to code, you started with building some synchronous APIs, probably.

Miłosz [4:16]: Yeah, I started with no APIs at the very beginning.

Robert [4:20]: When you were learning programming, there were no APIs yet.

Miłosz [4:24]: Well, I didn’t learn microservices first, but it’s for sure the first method of communication you can see while learning. It’s probably the most natural. That’s one of the upsides of using synchronous architecture, that it’s quite predictable in the flow. It’s kind of like the method call goes over the network. and when you read the code, you don’t really need to know what happens underneath. I think there are some frameworks that try to do it for you, even that you can distinguish between… Like the RPC calls basically look like normal function calls. That’s a separate topic. But the point is, that’s one good thing about SYNC. That it’s pretty easy to understand what’s going on in your function.

Robert [5:25]: And it’s natural for you, so the entry point for that is much smaller.

Miłosz [5:30]: If an error happens, or an exception is thrown, or whatever, you know it will exit the function, just like any other function call. That’s one of the pros.

Robert [5:44]: If you were working as a software engineer for longer, As usual, you may prefer simple approaches, because simple approach doesn’t require any additional complexity. But, well, as usual, it doesn’t cover all the possible use cases. So, unfortunately, there are some cases when the simple approach of synchronous communication is not enough. So, for example, you are calling some external service, it can be other team, it can be other company, And for some reason this is working slowly or even doesn’t work. It’s not the best situation, but this is where we are often. So we need to call some other team services or some external services. Doing it synchronously sometimes may be hard. I remember when we were working with some, for example, invoicing software that was outsourced and it was sometimes working, sometimes doesn’t work, but we were not able to change that because it was some contract and we were being obligated to use that.

Miłosz [6:47]: It might be even an external API completely out of your company. There’s some third-party service you use, and you need to use it, but you have to deal with the API, whatever it takes.

Robert [7:02]: Yeah, and sometimes, for example, also some real-world examples, sometimes you may be integrated with some external marketing systems. And for some unknown reason that I don’t know why, but often those systems are not that stable. And sometimes it’s part of customer registration flow, but from some other side, you need to send some data when you’re registering customer. And, well, you cannot risk that basically it will break customer registration

Robert [7:30]: flow because it’s part of the critical flow. So it’s also pretty problematic and in synchronous way, it will be hard to handle.

Miłosz [7:41]: Yeah, it’s one of the limits. I think it’s often a question people ask us about designing the architecture, so They recognize that they have a database transaction that does something like saves the user, let’s say, in the database. And does a few other things. And they also need to call this, let’s say, marketing software via API. And they are confused how to model it, because the API call, should it be inside the transaction, should it be outside the transaction, how to do it. And basically, there’s no good way to handle this in the synchronous approach way.

Robert [8:24]: Yeah, you can choose basically what’s less important. For example, okay, we’ve failed to send some marketing information, integration to some marketing system. Well, or maybe we should, it’s something critical, and maybe we should interrupt our registration function flow. It depends, unfortunately. But, well, when you are doing it in a synchronous way, you need to do those trade-offs.

Miłosz [8:48]: The worst case here is when you ignore the issue. Like, you assume, okay, but database commit never fails.

Robert [8:56]: Or the system will always work.

Miłosz [8:58]: Yeah, and it might be true most of the time, but usually if something can happen, it will happen on production sooner or later. And what I really don’t like about ignoring this challenge is that you end up with inconsistency in the system. And fixing this is super hard. First of all, you need to figure out if you have inconsistency. And if you have no proper monitoring in place, you might never know. You can see it weeks later or months later.

Robert [9:37]: Or even worse, let’s imagine that something on the other side, we assume that it’s stable, it’s always working, this external system of other company, and you have some big promotion, big marketing, and you have hundreds of people trying to register, and the system is down. And I guess that a lot of people may be pretty upset because somebody spent some horrible amount of money on marketing, preparing that. And because of this assumption, it’s not working. And often, you know, it’s important integration that if you disable it, you’ll find out that, okay, it’s hard to track what’s happening with customers and it’s a super big problem for everybody.

Miłosz [10:17]: Yeah, so it can easily become a bottleneck.

Robert [10:20]: Yeah, and if you are talking about bottlenecks, the other problem that often you have with synchronous architecture is problem with scalability. So often you have some operations that are heavy and they are working fine when you don’t have super big traffic, but hopefully your product that you’re working on at some point may have some bigger, spikes of traffic, and you may have more customers. And, well, if it’s not designed in a proper way, and it’s just done everything asynchronously, at some point, people will just start to see errors, because the operations that they are trying to do cannot be done, because you just don’t have enough resources. And if you’re doing it synchronously, you cannot do much about that, if it’s a critical part of the functionality.

Miłosz [11:08]: And the bigger the system, the worse it becomes. For example, if many services depend on this one, which is a bottleneck, it can kill your entire platform.

Robert [11:24]: I’ve seen that many systems are trying to do some workarounds for that, like runjobs or or even worse, database triggers. It’s some kind of workaround for that, but from our experience, it doesn’t scale at the end. Not only in terms of resources, but also in terms of maintenance. It’s a mess at some point of complexity, basically.

Miłosz [11:56]: So, the other approach is asynchronous architecture, and we don’t talk about the code level. So, stuff like callbacks, or promises, or director model, whatever you use in your code, it’s out of scope for now. Coming back to the context of two services that talk to each other, I would say that the goal of a synchronous architecture in this context is that we want to execute this in the background, this request, whatever it is, but also we don’t want to lose the information about the result. And even if the result is empty, maybe there is no payload. But the status message or the fact if an error happened or not is an important fact. So we want to handle it sometime in the future, but we don’t want to wait for it to complete in the current function.

Robert [13:00]: And again, some poor man’s approach may be putting it to the table and doing some batch jobs, but while it can work, it’s not a perfect approach, because Because at some point, if you have more jobs to process, you may end up with just taking more time than the entire day have. So paralyzing of that may be a challenge. And it’s often also delaying when it’s processed, basically. Yeah.

Miłosz [13:33]: So Gabriel mentioned in the chat that it’s easier to debug. I think this was about a synchronous approach. Yeah, that’s a super important point. If you have a classical function that goes one after the other of the steps, it’s quite easy to understand what’s wrong and debug. With icing it’s a bit more complicated when you get into that. So, what solutions do we have in terms of the architecture? I think maybe what many people may be familiar with is task queues, which is usually a simple FIFO first in first out queue of tasks that some worker processes pick up from. And it’s a pretty easy solution and can work well for simple use cases when you just want something to be done in the background and forget about it. The downside might be that if you have a more complex setup, it’s probably not enough to handle it well, but it can work.

Robert [14:50]: The two cases that I would have in mind when queues may be not helpful, compared to some event-driven architecture, is a case when you have more teams, so it doesn’t help you with organizational scalability. So basically, you have one queue, and it’s kind of similar to sharing database. So if multiple teams are sending messages to queue, and one message is spinning, it’s blocking other teams. So it’s not best experience. And the second thing is that if it’s order queue, even if you have multiple queues and you have some high priority or low priority queues, it’s also pretty limited in terms of performance.

Miłosz [15:36]: Yeah, but for a simpler architecture, it might be a pretty good solution. For sure, better than this, hoping it won’t break in a synchronous way. But a step forward in this are messages in general, and often using a message broker in the middle, like or a PubSub. So this is a similar approach, but instead of having workers, pulling tasks and doing one thing. We connect the services using messages. And there’s usually many ways you can build this topology, like who receives the message, and it depends also on the infrastructure you use. You might be familiar with message brokers like RobidMQ or Kafka, or the cloud-based ones, and they all have a bit different configuration. But the point is, it allows you to communicate between processes using those messages.

Robert [16:45]: Yeah, and what’s the key difference here is the fact that when you’re publishing some message, you basically don’t care how it will be processed, by whom it will be processed. Basically, emit some thing, usually event, and this event can be processed, maybe not processed, it may be spinning for a while, you don’t care. And it’s nice because it’s decoupling you from the consumers.

Robert [17:13]: And it’s adding the thing that I mentioned in terms of queues that it’s not handled by queues. So it’s creating some kind of way that helps with scaling multiple teams. So, if you are decoupled from customers of your teams, basically tens of, hundreds of teams can listen to your events and you don’t care if it will fail or not.

Miłosz [17:40]: So, this is if you use events, right? So, an event is kind of a message and it states a fact that happened already. This is a big, I think, more like a mindset shift rather than a technical thing. Because you can easily use messages, but if you send, for example, requests like in RPC as other messages, doesn’t give you the same decoupling benefits.

Robert [18:16]: Yeah, I think it’s nicely how it works out of the box with events. I mean, how it fits this model nicely. Events is information about the fact that happened. So it’s not something that will happen, maybe happen. No, it’s information about something happened and it cannot fail because it already happened. So you can’t say, oh no no it didn’t happen it happened and it’s nice because it’s in some way immutable so everybody can listen to that they cannot say it didn’t happen yeah and.

Miłosz [18:48]: They can’t not agree yeah it i think it simplifies the design often right because let’s say you have this process after i don’t know someone placing an order and instead of this one process figuring out tell other things that need to happen in the entire system, like send a receipt, send an email, notify the sales team, whatever.

Miłosz [19:16]: Can we manage your guess like that? You can just send a single order place event from this system, and then other systems react to this.

Robert [19:27]: We’ll go to anti-patterns a bit later, but I think one of very often seen anti-patterns is passive aggressive events on the pattern. So usually when you are creating an event, it should be verb in past tense. So, order placed, but it shouldn’t be event like order should be placed. Because in this case, it’s more like a command, and commands are more fitting the queue system. So you’re sending this queue, and one consumer is subscribing to that, and maybe order will be placed. But maybe it will be not placed, because you don’t have enough items in an inventory, for example.

Miłosz [20:11]: It’s a funny thing, because it’s mostly about naming things. There’s nothing much different about an event than a regular message. The difference is just the payload and how you call it. But if you pick the right names, the design becomes much easier to follow.

Robert [20:29]: I think it’s interesting because many people are looking at that like, ah, naming doesn’t matter, whatever, name it XYZ and it will be fine. But I will not agree after seeing how it can help, or it can not help if it’s not properly designed.

Miłosz [20:46]: Yeah, it’s about mental load, pretty much. How easy it is to understand the design. And then it’s also easier to debug issues or extend it.

Robert [20:57]: And I remember that some people asked about that in Watermill, so the library that we’re maintaining in Go. And what’s the difference between command and event? because basically API is very similar, but there are some really small reasons. And yeah, basically you can send command as event, but again, it’s just practical to follow this because it will be easier later.

Miłosz [21:25]: Maybe let’s come back a bit to messages in general, not just events. And I think one thing that’s pretty different in how you approach asynchronous way is error handling.

Robert [21:40]: Yeah, and I think it’s a cool thing that you get out of the box, we can say, with event-driven or many asynchronous approaches that is basic or un-message broker. Because if you are doing operations in the background, for example, instead of emitting an event or sending a message, when you are just doing it within a process, you never know if this process will not die. Because, for example, Kubernetes will say, you should be that. You never know. It can happen. And I know that in some smaller systems that are not mission critical, it doesn’t matter. You lose some operation, maybe somebody from operations team will be upset, but that’s it. But if you’re building some systems when this consistency really matters, and I would say that in most cases it is, if you are not storing things that you would like to process in Message Broker, in some cases it will just happen that your process will die and some operation will be not processed. And when you’re sending this event or message to your message broker, it’s up to the message broker to ensure that it’s processed. And only if you will confirm that it was processed, the message will be deleted.

Robert [23:07]: And it’s especially useful when you have some temporal outage. So imagine that your some external API was down for a couple minutes, your database was down for a couple minutes. In this case, when you are subscribing to the messages in MessageBroker, it’s not an issue. So messages will accumulate over the time, and it will just auto-resolve after the time. And I think it’s pretty cool when we work with some systems, when we’ve seen that, okay, some outages of external systems was happening, but it was not a big deal. So we sometimes receive non-critical alert like, oh, we have some messages that are stuck, but it was just stuck for a couple of minutes and later you can see how the chart is going down. Always pretty possible to see that.

Miłosz [23:52]: It can also be a bug in your code. You can mess something up and then the direct requests fail. So in synchronous approach all services calling your service this broken service don’t have much choice what to do like best they can do is probably return the error up the stack to the previous caller and then up to the user show on the UI right but if you use messages for this then yeah this this thing won’t be done in the moment, but then you can deploy a fix for the code issue, and as I said, all the messages can be processed again. Auto-healing is pretty cool to have. I think it’s also another factor that simplifies the mental model, because you don’t need to care about error handling in each scenario, how to do it, how to make this nice for the user to report or… Get over with. Instead, you know that you can just retry, and most of the time it will work fine.

Robert [25:14]: And another similar situation is when you are receiving bigger traffic because of many reasons. So many people went to your website or you have some attacks sometimes. It’s also kind of similar situation because normally you would not be able to process all those requests, but if you just receive a lot of messages, you can accumulate them and process them in order that you can have. And it’s also helping you to scale it with that, because if you have some way of autoscaling your application, you can basically configure it in multiple ways. But when this autoscaler will detect that, I know you are using a lot of CPUs or there are more messages to be processed, you can just autoscale more, process those messages and downscale your application back. And it’s kind of almost out of the box, we can say.

Miłosz [26:08]: Yeah, not always ideal. It depends if you need this thing done right away or not. But if you can wait a while, it’s much simpler than scaling up to handle all incoming requests.

Robert [26:22]: Yeah, and again, I think many people are mentioning event-driven or in-general asynchronous architectures in terms of technical and performance scalability, but we have also the perspective of organizational scalability. So you can basically integrate more teams over events, and thanks to the decoupling, you can add more teams and they can just work more independently.

Miłosz [26:53]: If you design them well that’s a prerequisite obviously because there are also lots of downsides of AC Cloud architecture maybe not downsides but challenges you need to be aware of right We mentioned a couple of times already that the design is important, so that’s one thing. It’s more complex in general and you need to understand the rules well. Ideally, the entire team would be familiar with what the stuff works. So the more experienced team, the better for this kind of architecture.

Robert [27:35]: Yeah, but as usual, so you have more complex problems to solve, it’s better to have more experienced team to handle that.

Miłosz [27:44]: Yeah, because for some platforms you can just use only a synchronous approach and be done with it. So at some point you just need to move to at least parts of async. And there’s lots of new issues you need to deal with, like eventual consistency or phrase conditions or debugging, for example. Local development can be a bit more complex. You need new infrastructure, the message broker, which also is something that you need to be familiar with and you need to maintain.

Robert [28:25]: Yeah, also testing is a bit more challenging because it’s no longer working like you’re doing HTTP requests and after request is done, you know that operation is done. No longer. You need to do some way of polling. So if you work with frontend a bit, it will be familiar for you because like in frontend, things are not that synchronous and sometimes you need to just poll for some operation that you expect that will happen and it will, eventually happen, but maybe it will not.

Miłosz [28:59]: Yes, but speaking of front-end, the UI also needs to be adjusted sometimes.

Robert [29:03]: Yeah, I think even Gabriel mentioned that it can be more difficult to handle errors in UI in async system. So you need to either use WebSocket or ServerSendEvents to push async errors, like in case of failed sync operation. Yeah, we totally agree that this is the challenging part.

Miłosz [29:25]: Exactly, yeah. This is it. So with synchronous approach, you can just return the error of the stack until the UI and then just show a big 500 error and something like try again or contact network support. With async approaches, the error is there somewhere in the system and the user doesn’t know if something happened or not, if it was successful or not. Let’s say this extra setup to show the status as recurrence.

Robert [30:04]: In other words, trade-offs. It really depends on basically what the system is doing under the hood. Because again, if you’re storing stuff to the database and it’s your database and it’s stable, So it probably wouldn’t be a problem. But if you’re doing some stuff in the background, it’s starting to be tricky. And it’s also easy to create a poor UX design that will be just misleading for

Robert [30:34]: people that are using that.

Miłosz [30:36]: Which is also counterintuitive because it seems like something specific to backend. And here you need to consider it in the entire flow, even in the UI.

Robert [30:47]: Yeah, and again, the thing that Gabriel mentioned, we are also big advocates of servers and events. If you are interested how to implement it properly, go. So we have a pretty nice article about that, that Miros wrote on our blog with super nice example. So we’ll link it in the podcast materials if you are interested.

Miłosz [31:13]: Any other downsides? what about observability?

Robert [31:18]: Yeah, it’s also a bit tough topic because when you are building your REST API, you can basically monitor your HTTP load balancer.

Miłosz [31:31]: You can open the console in the browser and most of the time you will see some error when it happened.

Robert [31:39]: Yeah, or just check your load balancer and you have 500s on your load balancer. You’ll see that something is wrong, we need to do about that. With an asynchronous approach, it’s not that easy because probably you should monitor how long is the queue, how long, what is the age of oldest message. It’s also not simple, because it depends on queue or topic or subscription. Sometimes having event dot is there or queue for five minutes is fine. But sometimes, if it’s delay 30 seconds, it’s maybe a problem. Sometimes having 10 messages spinning is fine, sometimes it may be a critical issue. So I’m not saying that it’s not possible to implement, but it requires more thinking and it requires more of what you said at the beginning, that you need to spend a lot of time on designing and understanding. Okay, maybe not a lot of time, but basically you need to have it in the back of your head.

Miłosz [32:49]: Have some tooling in place, like tracing at least, to see what’s going on, to understand where the issue is.

Robert [32:57]: Yeah, and some metrics that I know that not every team has, but in this kind of architecture, you should have that. And I know that also not all message brokers are exposing the metrics about, for example, oldest message or how many of the messages are spinning. So sometimes, unfortunately, you need to write it by hand and export it by hand. So, obviously, it’s some cost. For the other side, it’s also not an unsolved problem. I mean, it’s a solved problem if you find a well-known solution for that. It’s not rocket science at the end of the day. But it’s just important to keep it in mind, and keep in mind that every operation is also different.

Robert [33:45]: And there’s also one challenge that I think not a lot of people are mentioning, because it may be a bit counter-intuitive, but it’s a problem of decoupling. And why it’s interesting? Because usually we’re talking about decoupling in context of good things, but also decoupling may be challenging when it’s, let’s say, extreme decoupling. Like with events. So let’s imagine that you’re emitting some event, like order was placed, And you may have no idea, actually, who is listening to this event. So when you have your API, you basically see that somebody is calling this API. You probably can have some tracing and by that correlate from where it comes. But if somebody is listening to your event and the service doesn’t have good observability, you have no way to understand who is doing that. In worst case, even you have no idea if anybody is doing that. And it’s starting to be problematic when you need to change something. We’ll get into that, but it may be challenging.

Robert [34:46]: Especially if you have a bigger system with more teams, it may take months to deprecate some events.

Miłosz [34:53]: Yeah, and you want to get rid of some code that produces some old events, but you don’t know if a team is using it. But it can be in larger organizations that can be challenging.

Robert [35:05]: So yeah, in other words, decoupling is not always good. But I think it’s also similar like with Don’t Repeat Yourself. Often, don’t repeat yourself may be a good idea, but not always. And it’s a similar situation, because don’t repeat yourself. It’s trying to kill the coupling, but also, if you go into extreme, it may be problematic. It’s, I think, interesting how you can apply some patterns and anti-patterns from level of code to level of architecture in some way.

Miłosz [35:41]: Yeah, because it may promote teams that are very focused on what they do. They just publish some events and then they don’t care about whoever is doing with those events. Yeah, it’s harder to see than with API calls. I think that there are some tools for this. I’m not sure if we used any of them, but probably also depends on the on the approach you take, because there’s no one standard for how to define events. I mean, there are some, but probably depends on the company and message broker and technology use.

Robert [36:24]: Yeah, but I think you reminded me about this case and it’s pretty interesting that I remember in that in one place we had an event that we’re integrating, we have integration between two teams, but indeed, we have some data quality issues there. And what we ended at the end was getting rid of this event and instead having a synchronous endpoint that the other team needed to call. And we were able to basically do validation.

Miłosz [36:52]: Oh, that’s interesting.

Robert [36:53]: Yeah. And in this case, I would say that this extreme decoupling was bad because, again, there was problem of responsibility because somebody could emit an event that was totally not valid, and it was hard to enforce, because if this event was emitted, we’ve been notifying it after a while in our systems that we are starting to receive some events. It was like, oh, we don’t care.

Miłosz [37:23]: We did our job, we published the event. Yeah, that’s an interesting thing. It’s like on the boundaries of a team, when you integrate two teams, two services that belong to different teams. And the API becomes super important. And yeah, maybe sometimes synchronous is better in this case. Yep, that’s a good take.

Miłosz [37:50]: Let’s move to common anti-patterns.

Robert [37:54]: So you say that it’s not perfect and it has some pitfalls and something can go wrong with asynchronous architectures.

Miłosz [38:03]: Looks like there’s no super bullet. So I would start with two naive takes on background processing. So I’ve heard of systems, or seen some, where people use something like a cron job to process the background tasks. It’s similar to task queue, but more like in-house. And again, similar to task queue, it can probably work well for some cases, except you reinvent the wheel in this case so it’s a bit worse I would say and the main issue is if you have any more complex scenario then simple please do this later then it breaks it won’t help you much you don’t get all those guarantees of pub subs and so on and

Robert [39:05]: It’s also often processed at some magic hour so it’s creating delay basically, if you are doing bad processing with your current job.

Miłosz [39:15]: Yeah.

Robert [39:17]: But I would even say that it’s still better than database triggers because, I mean, it’s choosing between two evils, but at least I would say that it’s probably a bit easier to obtain.

Miłosz [39:32]: Yeah, logic in the database is a controversial topic. I wouldn’t want to work with a system that does it. And of course there is probably this one specific case where it makes sense. I’m sure there is, but in general it’s tough to update and manage this.

Robert [39:52]: And test and debug.

Miłosz [39:55]: Yeah. If I had to use a cron job like this, I would just do it in the code. And for example, in Go it’s trivial to spawn a new Go routine. So that’s a perfect scenario. Have anything in the loop, pull the sleep and do something every, I don’t know, minute or something.

Robert [40:14]: It can be some good starting points that you can migrate a bit later to a more proper solution.

Miłosz [40:21]: My main issue with this is that you reinvent something that has been solved. And many people put years of work into figuring out edge cases and everything. So maybe it’s better to do something that exists.

Robert [40:38]: Especially that, for example, with Watermill in Go, you can pretty easily use Postgres or MySQL as PubSub. So basically, you don’t need to add an extra architecture there. And you can start with something simple, and without many changes, change it to some real PubSub later. So I would say, I know that it’s not available in all languages, but in general, it may be also some nice starting point. and it’s just less to migrate later when it will start to be problematic.

Miłosz [41:10]: There’s one worse kind of cron I’ve heard about, is that you write a cron to check for inconsistencies in the system.

Robert [41:22]: Oh, no.

Miłosz [41:24]: So let’s take this example of calling this marketing API after a user signs up. So you realize this can sometimes fail and you write a cron job that does a check between the database and the marketing API and see if anything is missing and

Robert [41:46]: Basically you need to implement everything twice.

Miłosz [41:49]: Yeah and my my worst worry here is that you kind of promote ignoring issues And whatever happens, whatever issue like this you have in the future, a challenge like that, you will just write another consistency check. And this is another code you have to maintain. And basically you just sweep the issue under the rug instead of fixing the root cause. So yeah, I think that’s too naive to take to work in a large-scale system.

Robert [42:30]: I can remind one case when it makes sense, but I would say it’s rather an exception. When you are working with some, let’s say, critical systems, you are quite sure that everything is fine, but you would like to do some double-check.

Miłosz [42:51]: Checks in general, consistency checks in general, I wouldn’t say that are a bad thing, but if you are doing them because you can design a better API, that’s the issue. You anticipate them to happen. I would create a consistency check in places where I’m sure they won’t happen, but I want to be extra careful anyway because it’s so critical.

Robert [43:19]: I think the big difference is if you are implementing some compensating operation in this checking. For example, if you are working in some financial system, it’s typical to do some reconciliation reports in some intervals, but it’s making reports. It’s not doing the compensating operations. I mean, your logic should be valid and you should just do validation at the end to ensure. But again, It’s just for critical systems that you need to be sure. For example, if your company or whatever, $100 came in, $100 should come at the end. But again, it’s not about doing compensation. It’s all about double-checking the logic that you have.

Miłosz [44:06]: Right. Especially if you have a few of those in the most critical areas, that should be fine. But if in every API call you need to do this check and compressing logic then it won’t be maintainable after a while you will go crazy just trying to cover all these cases and then any API change or whatever will also break this so you have more code to maintain again I would fix the root cause if you know it can happen.

Robert [44:42]: Agreed Thank you.

Miłosz [44:44]: And one more regarding this naive approach is, you know, when you use messaging and use kind of events, like you said, those passive-aggressive events, or just something that’s more like technical, not in the domain or business sense.

Miłosz [45:03]: It can seem like smart design, but not always the case. It doesn’t automatically give you good results. I worked once with a system where there were two applications, like an old one and a new one, and they shared some entities. And there was a system that, for every change in the entity, published an event that it changed, the other system consumed that and updated the entity, and the other way around as well. So, you know, if a naive setup like this, you ignore many edge cases that can happen. And there are some variants on the entities, they have some rules, but if you just do a naive update like this, even if it’s asynchronous, it doesn’t give you anything and can be even worse because there can be some inconsistencies and you won’t even know. Because maybe some message arrives out of order, or is overwritten by the other one.

Miłosz [46:16]: So this comes back again to the event design topic, right? This needs to be a proper design. You need to spend some time and understand what you are doing.

Robert [46:30]: And I think it’s probably time to give a big shout to event storming. That was usually pretty useful for designing some event-driven systems. But not only, because it may be also suggesting that event storming is for designing event-driven architectures or async architectures, but it’s not. I mean, event storming is good for basically designing any kind of system, but it’s pretty useful that you have notation that can be mapped to your software really directly, so I recommend it to check event storming. If you’re looking for some materials how to do that, there is an e-book that is probably unfinished for the last 5 or 10 years. But no worries.

Miłosz [47:14]: It’s great anyway.

Robert [47:15]: I think 5 or 10 years earlier it was already very good. So Alberto, really good job.

Robert [47:29]: Another downside or trap related to asynchronous architectures is creating distributed monolids. Because, like many useful techniques, they are helping with many things, but often they are also opening gate of hell.

Miłosz [47:44]: It’s the new normal. Instead of monoliths, we have distributed monoliths.

Robert [47:51]: Let’s wait for distributed monorepos.

Miłosz [47:54]: We’ve seen it too many times.

Robert [47:57]: Yeah, but the usual scenario is like, okay, microservices are great, we need to be scalable, etc. And the team is starting to build microservices, and you start to have more microservices than people, and at some point you’ll notice that communication between them is hard because if one microservice in the middle will not work for a while because our Kubernetes cluster is not that stable yet, all our requests in the platform are failing because of cascading failures, we need to do something about that. And unfortunately, often a reversing decision of going into microservices is not on the plate, because come on.

Miłosz [48:41]: Too late for that.

Robert [48:42]: Yeah, we invested that much in many stuff that we don’t need. So, often the solution is, well, let’s use event-driven architecture. And, well, it’s making all the problems that the team had earlier, it’s making it deeper and deeper, because debugging everything is harder, understanding everything is harder, everything works slower. Or, okay, the one upside may be that, okay, at least it’s kind of more stable, with caveat that usually there is the thing that Gabriel mentioned about UX, that, well, if something is processing in the background, it’s a bit harder to show it in the UI. So compared to showing error directly to user like earlier, now something is happening and user have no idea what is happening. And yeah, it’s a pretty problematic thing that we’ve seen many times. So it’s important that asynchronous architecture can solve, in this case, some problems. But again, don’t be afraid to do the thing that we like to recommend, and we’ve did multiple times and it’s the microservicization. So basically getting your services and making less services. Again, having more microservices than people in a team, it’s terrible.

Miłosz [50:04]: But the interesting part about this pattern or anti-pattern is that it comes from good intentions.

Miłosz [50:12]: It starts with this idea that we have a small team now but we will grow and we need to to be able to separate some code easily. And then you have those separate services and you figure out, okay, let’s now just connect them like we did with function calls before. And then it breaks, but it comes with this good intention of separating the constants.

Robert [50:44]: Oh, I think it’s problematic with good intentions, because I think all the bad systems that we worked with was built on the good intentions. Because, you know, it’s rarely the case that somebody had a bad intention. Like, let’s build the worst systems that we can. Yay, hooray, it will be so great idea. Usually, the intention is good. But I also understand that sometimes it’s also hard to reverse this decision. But from the other side, it’s also sometimes leading to systems that are totally unmaintainable. And, well, even I can say that rewriting is the only way to redo something.

Miłosz [51:27]: It’s just counterintuitive, because you think that splitting things makes things decoupled. But it’s not really the case. It depends on how you communicate, really.

Robert [51:38]: Yeah, so basically it’s all about splitting things that are coherent. So if you are splitting things that are coherent, you are just ending up with a lot of accidental complexity. And it’s pretty bad. I think it’s also partially the reason from where some people see asynchronous or even given architecture as anti-pattern and over-engineering. Because if you’ve seen many or a couple projects using some patterns and you’ve always seen that is overcomplicated. Well, it’s logical that next time when somebody will suggest you let’s do it asynchronously or let’s use evangelical architecture will be pessimistic because you’ve seen that, okay, it was always an over-engineered problem. But again, it’s probably, it wasn’t a problem of evangelical architecture. It was probably a problem of applying it in the wrong project.

Robert [52:30]: It’s important to keep it in mind. So if you’ve seen some technique or tool in many places and it was always overcomplicated, it’s maybe not a problem of this tool. Maybe it’s a problem of a person using this tool, or this person maybe didn’t understand this properly. It’s a thing that we’ve seen in many places, and sometimes it’s requiring a lot of effort to convince people that no, no, it’s not because of this tool, it’s because of applying it in the hard place. But again, if you are not convincing somebody, it’s not our issue at the end.

Robert [53:09]: I mean…

Miłosz [53:12]: There’s also not really a strict definition of all those techniques. And depending on the language you use and other technology, it will be a bit different. That’s why it’s hard to judge. Wrapping up distributed monoliths, the coupling is the issue here and… Synchronous is not really decoupled by default, because events are also a form of coupling, in what we mentioned before, right, so… On the one hand, you have this, you public an event and you don’t care what happens later, On the other hand, it’s also a kind of contract between you and the consumers. Let’s say you have this distributed monolith and you now replace the RPC calls with messages. Is it decoupled? No, not really. It’s the same issue, just with different transport. So it’s more about the design of communication, not really the async or sync here.

Robert [54:27]: Because, yeah, basically you can implement RPC over queue and it will be basically the same. I mean, it doesn’t change much if you are coding something directly over HTTP, if you are expecting to have a response from your operation.

Miłosz [54:44]: Yeah, I think Gabriel mentioned this before, that there is one interesting pattern is used by NATS. The call seems to be synchronous, but it’s asynchronous under the hood. I’m not sure about this exact one in NATS.

Robert [55:01]: But yeah, I guess that it may be about something that Celery supports. So basically, it’s basically doing RPC over queue.

Miłosz [55:11]: Yeah, this is a weird hybrid approach when you can do synchronous over message broker. It kind of works.

Robert [55:20]: It has some use cases, obviously, but it’s not really asynchronous. We also have each in Watermill, so we have the Request-Reply component. But again, it’s not really fully asynchronous. So it’s nice because you can have some response to know if something was successful, and if you are no longer listening, it will be processed anyway. Not like with Request, because if your Request data is originating, let’s say originating request, and it’s calling everything later, and this request will be cancelled. So if you are handling context cancellation properly, everything else should be killed, basically. So it should be kind of rolled back in all requests. But if you are doing RPC over cube, it will be done everywhere anyway. There are some cases when it makes sense, but…

Miłosz [56:07]: Maybe one promise here is that you use one platform for all communication. So it sounds good on paper. We don’t need RPC because you can do everything over messages. Maybe it’s a good idea. I don’t know.

Robert [56:22]: But it’s tricky. If you have a request and you would like to, for example, insert something in the database, your request was cancelled. And this cancellation is not propagating. because the translation is not propagated to the queue, so it can lead to some strange inconsistencies. Because you killed the request, so expect that it should be not written, but it’s somewhere later in going through the queue, so it will be done anyway. And it can lead to some strange state where you don’t know what really happened.

Miłosz [56:54]: Yeah, I would summarize it as messages behave differently from requests most of the time. So, if you use one as the other, it can work, but you can have some weird edge cases down the line, because it’s just a different method of communication. So I will be careful with this. Okay, and any more anti-patterns?

Robert [57:24]: So I have two connected anti-patterns here. So the first one is using no message ordering. So it’s maybe not anti-pattern, but it’s something to watch out. Because in many systems, you need to have some way of ordering your messages. So let’s imagine that you have event for subscriber subscribed and subscriber unsubscribed. And the problem is, if you not order them properly and build some read model based on them and they will be received out of the order, you may unsubscribe a person that was not subscribed yet, if the unsubscribe event will arrive first, because it can if it’s not ordered, and later you will have a subscribed event that should be…

Miłosz [58:13]: Yeah, that’s one of the things you learn when you first start using messages. Those wish behaviors that seem simple at first.

Robert [58:24]: Yeah, and the simple solution may be, let’s just order every message and just have everything ordered. And yeah, it sounds like a pretty good idea, but sometimes it may work. But it has also one big downside, at least. I would say two big downsides. So the first one is performance. So, basically, performance of processing those messages will be limited to how fast you are often to process them. And basically, they will be processed one by one. And if operations will be slower, it will just start to pile and you will be not able to process them.

Miłosz [59:01]: Or block all processing of all messages.

Robert [59:05]: And the second thing that I believe is worse is that if any of messages will spin, it will block all other messages. And it’s pretty bad because it’s killing all the advantages of asynchronous architecture in terms of stability.

Miłosz [59:26]: Yeah, on one hand it can help you not care about ordering, which seems like a nice idea. Yeah, but if you make it too wide, then it can be also an issue. There’s a lot of things to consider here, so it’s hard to give someone a solution, right? So much depends on the domain and the system design. Sometimes you can get away with no ordering, but you can use something like a version. I remember one system where we didn’t have ordering at all, because all events updated different parts of a read module.

Robert [1:00:12]: And I think it was also the problem where we’re using Google Cloud PubSub and it didn’t support ordering at this point also.

Miłosz [1:00:20]: Yeah, we can kind of get away with it, but I remember we actually used times for solving some conflicts, so it’s also not ideal. Depending on timestamps is controversial.

Robert [1:00:33]: It’s hard to synchronize. I think in some places where it was possible, we’ve been using versions. So basically every event was that within aggregate, let’s say that, within the, let’s say, transactional boundary, so where we need to keep ordering, had an incremental version. So if we received an event that was out of order, we’ve been rejecting that or in some cases we’ve seen that, okay, we received newer price, for example. Sorry, we received older price and we already stored the newer, so we just can ignore older. But the downside is that you need to support it in custom way for everything.

Miłosz [1:01:17]: Not super complex, but something to consider.

Robert [1:01:20]: But nice to have it out of the box. For example, Google and PubSub have ordering keys. Kafka has partitioning. So there are method workers that just have it out of the box. And you don’t need to think about that later, basically.

Miłosz [1:01:33]: Remember, you can talk a bit more about this in the next episode.

Robert [1:01:39]: Yeah, I think ordering is probably something that can be just an entire episode about that. because this is pretty… Maybe not complex, but it has many different use cases that it’s good to know about. Because in many cases, you should approach it differently. Often, you don’t need to care that much. You can choose something simple and it’s fine. But it depends a lot on scalability requirements or and, let’s say, resilience.

Miłosz [1:02:10]: The balance is difficult. It’s easy to do. No ordering or ordering by every message something in the middle you have to consider manufacturers

Robert [1:02:20]: Yeah but tldr i would say it’s try to do as narrow ordering as you can so for example try to do some subset of events for one customer maybe or you need to basically analyze where this ordering matters and based on that, do really narrow ordering and it will work, basically.

Miłosz [1:02:48]: There’s one thing that’s similar between async and sync. I think it’s API design. So with Synchron’s approach, you have some kind of contract in how your API works. If it’s a REST API, you can have some open API document with gRPC, probably protobuf schema.

Robert [1:03:14]: Yes, I see that Gabriel asked about that. What we are usually using is protobuf, because you can marshal it later to at least two formats. Usually we are using JSON as a transport format, because it’s easier to debug. But if you care about performance and the size of message, you can use protobuf, so it will be a bit smaller and faster. And you can also convert it to multiple programming languages, and also it’s nicely supported the versioning of fields.

Miłosz [1:03:51]: There’s the cloud events schema, I think. We didn’t use that yet. Protobov usually does the job. The point is you need to follow similar ideas for both sync and isync here. So your event schema or your API schema is your contract. So you have to make sure there are no breaking changes. And yeah, the API is stable. And probably do some versioning if you need to do breaking changes. We sometimes have events with version delay, like v1, just to anticipate it. Not always needed, but sometimes useful.

Robert [1:04:44]: Yeah, and it’s also worth mentioning that many message brokers are supporting some kind of schema registry out of the box. So those message brokers can validate the schema for you. So, for example, Kafka has it. I think Google and PubSub have it. I think it’s also worth mentioning that some companies are using Avro for that. I think we didn’t use that, but I know that it’s often used in some companies, but I think I guess that it’s more in Java world. I think in Go world, more gRPC is used, at least from our… Protobuf. Protobuf, yeah. Protobuf is more used in our bubble.

Miłosz [1:05:29]: It’s pretty nice. And as a plus, generated code is It’s a quite popular choice and it works pretty well with Protobuf as well.

Miłosz [1:05:49]: Shall we move to the final question? What to choose?

Robert [1:05:54]: Unfortunately, it’s as you heard. There are really many factors to consider. Yes.

Miłosz [1:06:05]: And the balance is the hardest part, as always. There is no silver bullet, so you have to figure out what balance means for you. It’s a bit like the CAP theorem in databases, right? So you have consistency versus availability, and you have to pick one.

Robert [1:06:26]: Yeah, so in async system you basically are sacrificing consistency, but you have better availability.

Miłosz [1:06:36]: You have eventual consistency, but that’s it.

Robert [1:06:40]: But it’s eventual. Cool.

Miłosz [1:06:43]: So what should we choose, do you think?

Robert [1:06:48]: I think to give you some simple algorithm, maybe. If you’re in-depth, you can always start with synchronous. I would say that in many cases, it will be a good starting point. In many cases, using async, by default, maybe, Maybe not over-engineering, but maybe not needed at the beginning. And if you have infrastructure already in place, switching something from async to sync, may be not that big difference, as long as you didn’t do any strange thing. And it’s especially simple if you are using clean architecture, because you should have some application layer entry point to your application. So basically calling you from RPC or event handler or something, it can be a pretty simple change. Of course, it depends a lot on the use case, but again, it’s if you’re in that. In some cases, you may be just sure, like you’re calling some external system and you know that it’s not stable, do it synchronously. You have some critical flow that you know that it needs to be working, you need to send some external data, do it asynchronously. Again i’m saying more about cases that you are not sure.

Miłosz [1:08:10]: Sometimes it’s a product decision as well if something can happen in the background and sometimes it’s tricky because in the background can mean a second like later or you know 10 seconds later so it so it’s also it takes some skill to to be able to talk to you know whoever your stakeholder is whoever is responsible for the product to understand if it’s fine to do it in an asynchronous way. Which means most often it will show up in the UI right after, but sometimes it will not after a few minutes maybe. That’s also counter-intuitive. Why would you talk with your product owner about sync versus async architecture, sometimes it makes sense.

Miłosz [1:09:07]: Probably it’s good to watch out for the extremes, as always, right? So, not everything can be handled by synchronous architecture, as you mentioned. There are some hard limits, where more complex systems just won’t be able to do it. But as you said, sometimes doing everything over messages can be also over-engineering. So if you ever consider doing everything one way, probably you need some balance. It’s usually a clear sign.

Robert [1:09:45]: And if you maybe didn’t work in any system that required building asynchronous architecture, it’s fine. But anyway, I would recommend to check out how you can do that, because it’s just know what tools you have available. So if you have a problem that will require asynchronous architecture approach, you’ll just know, okay, I have tool for that, it fits here, and I can use that. So you will not need to learn it when you have this problem, because it might be just too late.

Miłosz [1:10:17]: Sometimes a hybrid approach may be fine. One is what we mentioned before, the RPC of our messages. A bit unusual, maybe if it fits some use cases. Something else that comes to my mind is CQRS here. So, TLDR is a pattern where you split your controller functions, let’s say, into commands that change the system and queries that fetch the data. Super short definition. Not to dive too deep into it, but the interesting thing about it is that it gives you this extra structure in your code. Commands and queries are separate. And then it can make mixing them and how you call them a bit more intuitive. Because queries are usually synchronous. So someone is waiting for the data, that makes sense. and commands can be either. So it can give you this mental framework to decide how to mix sync and async in one project.

Robert [1:11:33]: I think it’s interesting, the CQRS example, that it’s a good example of technique that people really misunderstood because for many people if they hear CQRS it’s immediately asynchronous but no. So it’s totally misunderstanding of this pattern. Again, we’ll add to materials our article about SQLRS, because we’ve been covering that. Milos also did a really nice presentation about that, so I think we can also put it to the materials. It’s important to mention it all in our SQLRS. It’s not about doing things asynchronously. It’s some implementation. it’s worked nicely, but you can do skewers with just doing everything synchronously. And it’s nice because if you are using skewers in a proper array, so when commands are not really returning stuff, you can convert it to asynchronous architecture really easily, because it should not really depend that much on having the feedback from the command.

Miłosz [1:12:40]: It helps if you don’t have these mixed functions that change the state and also return data. It can be more tricky to decide what to do with them.

Robert [1:12:49]: And it also doesn’t require some super big overhead because this is also something that we heard from many people, like, oh, security requires so much effort. But no, it’s just a matter of, instead of having some super big application services with everything, just have multiple little services with commands instead of arguments and commands that doesn’t return stuff. And that’s it. And you have the basic secure implementation in your product. You can put it into your CV.

Miłosz [1:13:24]: Um…

Robert [1:13:27]: So yeah, I already mentioned also the migration. So SecureOS helps with that. But you can also get some hybrid approach, because often at the beginning of the project some endpoint can be synchronous, but with time, more and more stuff is added to that.

Robert [1:13:45]: We have actually a good example in our trainings platform. So we have webhooks from our payment provider about information that somebody bought our training. And at the beginning it was pretty simple, just adding, ah, somebody has training. But now it has a lot of stuff and we need to actually convert it to asynchronous to using events. But now it’s probably doing five things. And sometimes some of those things are failing because it’s just doing a lot of that. But, again, at the beginning, it didn’t make sense to do it asynchronously because it was pretty lightweight, but with time, you can convert that. And you can even start with still having a synchronous endpoint and with emitting some internal event or command and just cutting it part by part. If it’s some internal logic, you can, at the end, maybe just remove this endpoint and say the team that is calling this endpoint, you can now just emit an event or the opposite. it. So, for example, if you have a problem with data quality and other team is not taking responsibility of that, introduce endpoint instead of event and say, if you will not respond that this is valid, sorry, you need to provide this valid event and just agree what data should be in this endpoint to make everybody happy.

Miłosz [1:15:04]: And speaking of code structure, maybe one more thing that can help is layered architecture or cleanarchitecture or hexagonal or whatever you want to call it. If you place the logic in some isolated logic, place in your code, and then have two entry points that are either messages or RPC or HTTP. It’s super easy to switch between how you call the handler, because you have just two entry points that do nothing else, just just map the request to your internal structure. And it makes it super easy to migrate as well.

Robert [1:15:53]: And it’s happening, I would say, because it’s often argument in clean architecture, like, okay, in clean architecture you’re separating database, you can switch it, or whatever. It doesn’t happen often. But changing things from synchronous and asynchronous actually can happen quite often.

Miłosz [1:16:11]: Or just both at the same time. You can have some logic that can be executed synchronously or asynchronously by message. And if you use the layered approach, it’s super easy to do.

Robert [1:16:22]: Yeah, and it happens pretty often because database, again, you are not changing it very often, but changing things from synchronous to asynchronous or vice versa, the first I think it’s happening more often, but it’s happening often. Because, again, just some operations are starting to be bigger and bigger. You start to have more and more teams. Your product is growing. This is, hopefully, what’s happening. And in this case, it’s just making your job easier and saving all of your time. So it’s, I think, a great investment that is not mentioned that often.

Miłosz [1:16:56]: Yeah. Okay, maybe as a finishing point, we can talk about reducing this accidental complexity that comes with starting with messaging. First, once you start, but there’s a lot of topics to grasp. We talked about some today, but it’s not really an episode about event-driven architecture, only the comparison of Async and Sync. So one idea might be to start with some PubSub you already have, which is probably SQL database. I probably have some Postgres or MySQL running around, and you can actually have a pretty simple PubSub running on it. That should be good enough for most use cases. Or you can use a cloud-based PubSub. It means you don’t need to maintain the architecture or infrastructure, which can also be a complex task sometimes with high availability in place and all this.

Robert [1:18:10]: I think Postgres is a great starting point and it can handle much more than you think. I actually have some idea about one article that maybe we can show how far away we can push Postgres-based paths up, because I’m pretty sure that we can push it pretty far away, but let’s see.

Miłosz [1:18:29]: Yeah, unless you want some complex topology. For starting out, it’s a great way to start. Another way, maybe using a high-level library that abstracts away some of the concepts for you. So for example, Robert mentioned earlier Watermill, which we maintain in Go. I’m not sure about other languages. I think it’s something there’s something similar in Python. Probably in every other language there should be something like that. Because you probably don’t want to deal with all the protocol, the infrastructure you use, all the low-level details, so you can focus on the high-level publishing and subscribing to messages.

Robert [1:19:21]: It’s pretty easy to mess something with that. So lose message or make some sub-toolbox. We’re creating WaterMeal already for a couple years, and you’ll be surprised how many little quirks you can have in every PubSub. It’s very hard to debug at the end, because you may just be losing one message for one million messages, and it’s very problematic in systems. It’s good if somebody already did it for you. If, for example, you don’t use Go, We’ll recommend to use Go, but if you can use Go, you can always even look on our SQL implementation and see and inspire by our queries.

Miłosz [1:20:06]: Just use Go.

Robert [1:20:07]: It’s a best idea.

Miłosz [1:20:11]: It’s not that complex in the end. But there are also some quirks, so it’s better if you don’t reinvent the wheel. Because it took us some time and I think four major versions to arrive at where the SQL implementation is right now. So I’m going to manage cases.

Miłosz [1:20:33]: So, if you’d like to learn more about async architecture and event-driven architecture in particular, we have an event-driven training that’s sales open for the next two weeks, starting from today. So you can learn those concepts in a more hands-on way, solving exercises.

Robert [1:21:00]: If you are interested in learning, Doug, you are actually pretty lucky, because we’re running this training by Sayer, so if you are listening to it later, it may be already not open, but again, it will be open next year.

Miłosz [1:21:17]: If you listen to it later, you can join the waiting list and wait for the next sale. We run it as a cohort, so everyone learns together. And yeah, you can just Google Go event-driven or we will put a link in the notes.

Robert [1:21:32]: We have pretty good SEO, so it should be on the top of the list when you will Google Go event-driven.

Miłosz [1:21:40]: Okay. Before we move to the Q&A.

Robert [1:21:45]: Yeah, definitely. Okay, so Dima mentioned that naming always matters. Yeah, so definitely it matters more often than you will think, but…

Miłosz [1:22:07]: Okay, so Gabriel shared more about that events should be facts that already happened in the system, what we talked about. Provides many advantages as states can be recreated from events. Yeah, that’s one of the things we didn’t mention. It’s super useful so you can have this event log somewhere in your system and then reply those events to do whatever you want.

Robert [1:22:31]: And in other words, you have event sourcing, but event sourcing is not asynchronous. I mean, event sourcing can be asynchronous, but it can be synchronous as well. So I would say that it’s synchronous by default. And then it’s probably a nice topic for other episodes. So we mentioned already financial systems in a couple places, and event sourcing is actually great for this kind of system. When you need to have audit by the law or you need to have audit because you would like to understand how you ended up with some state.

Miłosz [1:23:10]: Gabriel also mentioned this dependency matrix of events based on types could be helping an organization. I think I said something like, I wonder if there is a ready software for this. I think event catalog or something like that, but we didn’t use anything that we could use across teams and I mean place where you can just go and see which team uses what because it is to integrate with your code I guess ideally if you don’t want to keep it up to date over time which can be a manual effort so it sounds like a challenge to do it for any language you use and if we find something like that we’ll let you know

Robert [1:24:15]: So Gabriel mentioned that there is a nice chapter about event storming and learning domain driven design from Orly. I didn’t read this one, I’m not sure if you had a chance.

Miłosz [1:24:25]: I have it on my shelf. Still waiting.

Robert [1:24:29]: TBD. Maybe this summer.

Miłosz [1:24:32]: But everything about event storming, I think you can recommend right away.

Robert [1:24:38]: Yep, definitely. I think event storming was one of the biggest game changers for working with some more complex projects and collaborating with some multiple stakeholders.

Miłosz [1:24:51]: Also not super complex to learn, but you have to be willing to talk to people. Can be controversial in tech teams, sadly. Ordering is probably not needed if you have a good state machine, which guarantees the state of aggregates. It sounds to me connected to event sourcing again.

Robert [1:25:17]: Or the thing that we have with versioning. That’s true. You can implement that. But from other side, if you also do it properly from a message broker perspective, it can simplify also a lot. Because you can just listen to event and don’t care about that. We’ve been able to compare migrating to that and ordering, at least in our case really help to simplify it a bit. Because I think it’s good because if it’s done properly, it works out of the box. So kind of out of the box, let’s say. And it’s harder to make mistake when you need to remember about something. But you can also implement universal state machine that can handle that. But universal things that are implemented by you often may have some chances.

Miłosz [1:26:07]: But you can see how many approaches there are to ordering. It’s a pretty long topic. You have to consider many approaches. So thanks for mentioning this one.

Robert [1:26:21]: So Gabriel will have Monopoly on our chat today.

Miłosz [1:26:25]: Thank you for contributing. You are the third author today.

Robert [1:26:31]: I think at some point we should try to also share the link to join with us. That would be cool. Risky but cool. Gabriel also recommended Reactive Manifesto, which is a nice definition of Reactive Systems. I remember reading that probably a long, long time ago.

Miłosz [1:26:52]: Yeah, same.

Robert [1:26:53]: Recommend checking, yeah?

Miłosz [1:26:55]: I feel like an ignorantat the moment, sorry.

Robert [1:27:00]: In Scala, there is a Lagon framework, which was nice, Akka is under the hood. Maybe. We are not Scala guys.

Miłosz [1:27:12]: Yeah, so it’s good that there is similar software, so you don’t need to understand everything, what’s going on, which can also be tricky because sometimes you may lose some you may not understand something correctly, the configuration of some PubSub. But to start out, it’s super important to have this A-level concepts in place. You just publish a message, subscribe to subscribe to a message and don’t care about what the bytes do underneath.

Robert [1:27:41]: And also Gabriel mentioned that it’s under the hood, so it would be actually interesting to see how actor model is going up. I mean, if it’s visible or not. Personally, I’m not a big fan of actor model. I mean, I don’t like the mental model of actor model. So even driven fits in my head much better because you can get to even storming and it maps one-to-one in some way and actor model for me is a bit more like magic mental mapping that you need to do and, I don’t know what’s your opinion on that but.

Miłosz [1:28:13]: Yeah I didn’t use it much except for playing a bit with elixir so yeah I can’t say much maybe you can do another episode sometimes about comparison

Robert [1:28:29]: Fortunately Gabriel said that but I love Go much more.

Miłosz [1:28:34]: Way to go!

Robert [1:28:43]: Okay. Cool, so a couple more seconds for more questions.

Miłosz [1:28:54]: While we wait, it’s the time you should hit the subscribe button and rate our show

Robert [1:29:02]: Let’s wait Please click this button You didn’t click yet I see your screen Okay, thank you So, also now 5 star review, Let’s write something nice. We’ll wait here. We have time.

Miłosz [1:29:23]: Yeah, and instead of the YouTube subscribe, there’s an even better subscribe button on our website, where you can join our newsletter, so then you won’t miss the updates from us.

Robert [1:29:34]: Plus, you will also receive notifications about the episode notes. There is an entire transcript. We’re linking to all sources that we’re mentioning here, like WaterMule, like ebook about event storming, and about a couple more things that we mentioned and I don’t remember now, but it will be on all-in materials for this episode. Plus, also if you release any new articles on the blog, you’ll be notified. So I think it’s pretty cool, because I mentioned, for example, article about SSC that we have on our blog. So we’re trying to share a lot of quality stuff that will be useful for you. Something more, Miroz.

Miłosz [1:30:19]: I think we have no more questions. Thank you, everyone.

Robert [1:30:22]: Just a reminder. So, Go Event Driven, available for next two weeks. Later, you need to wait for half year. So, don’t wait for the last hour, because…

Miłosz [1:30:34]: Better to join now.

Robert [1:30:35]: We are starting, and you will need to wait. Cool.

Miłosz [1:30:41]: Thank you, everyone, for joining us. Thank you, Robert, for joining me.

Robert [1:30:46]: Thank you, everyone. Thank you, Miłosz. And see you in two weeks. And we’ll talk about…

Miłosz [1:30:52]: Event-driven architecture more in-depth than today.

Robert [1:30:55]: Sounds like a good plan. So, see you in two weeks. Thank you very much.

Miłosz [1:30:58]: Bye-bye.

Robert [1:30:59]: Bye.

Let's stay in touch

Never miss a new episode: get notified directly, without relying on unpredictable social media algorithms.

You'll know when we're live, get updates on new episodes, and receive exclusive content.

Last update:
  • May 28, 2025