The case for Axon Server over MongoDB for your event store
Often, when architects and engineers are investigating new patterns, frameworks, or technologies, they look to find the quickest way to get up and running to demonstrate business value. So when one of these frameworks depends on a piece of infrastructure (database, middleware, etc.) to help keep the project moving, there’s a tendency to want to reach for what’s familiar or what appears to be the “easy” path.
But making the quick decision to use what’s “easy” can, at times, lead to more complexity down the road.
In this blog, we'll examine one such instance (before my time with AxonIQ) when I chose MongoDB as the event store for a project using Axon Framework, causing the team to encounter a series of challenges that could’ve been avoided.
Process
Initially, we chose MongoDB for the project because it was the database the team used in another service – Axon Server wasn't available then. The team became comfortable with MongoDB and stopped using RDBMS options for transactional services. We stuck with it, even after Axon Server's release, because it seemed “easier” than migrating to a new tool.
However, that proved a mistake, because although MongoDB provided great initial delivery speed, we’d find ourselves running into issue after issue, forcing us to spend more time managing the complexity than necessary.
The Issues
The first issue we ran into revolved around the fact that MongoDB does not have the concept of a global incrementing index that can maintain the order of events. To get around this, the axon-mongo extension uses a timestamp stored in a string field. The timestamp is populated upon event creation in the application, not MongoDB.
This sounds like a nice “easy” solution to allow using MongoDB as an event store, but since this date is stored as a string when sorting on the value, it can cause events to be handled out of order (see here). This issue happened to us on several occasions, causing us to chase our tails to try to figure out what happened.
Only after finding the aforementioned issue in the axon-mongo repo could we understand what was happening. So, while this particular issue has been fixed, it’s not the end of possible issues the use of a timestamp could cause. There’s still this issue, open since 2020, which speaks to the challenges of performing an event replay because the size of the tracking token becomes too large due to the size of timestamps. If we’d made the move to Axon Server (which uses a global index) sooner, we most likely would have avoided having to deal with the complexity that the timestamp field placed upon us.
Let's address the next issue, which arises from the polling model employed by Axon Framework when using MongoDB as the event store. The polling approach to look for events needing to be processed uses a “lookback” window of one second based on the timestamp of an event.
This lookback approach is in place because events can be inserted in parallel by different applications, but in a Mongo cluster, there is no way of telling how these inserted documents become visible for applications connected to different nodes in that cluster. As our system began handling significant loads, we noticed that certain events were being missed. We discovered the problem during an audit of our daily processing.
Digging deeper, we discovered a delay of slightly over one second from the time of some events being created to being written to our events collection. Ouch! Upon investigation, we determined that this issue occurred within command handlers, which generate multiple events. We used the document-per-event storage strategy, which requires multiple writes to complete the transaction, saving the events performing poorly. The poor write performance was a direct consequence of the ever-growing size of our domain events collection and MongoDB’s inability to keep up under load (see next paragraph).
Consequently, when combined with the default lookback window of the MongoDB extension, we randomly missed events. To address this, we implemented a temporary solution by increasing the lookback window to two seconds.
While our increased lookback window resolved the missing event problem, it had the unintended consequence of negatively impacting the performance of our event processors. This performance degradation was due to the query to return events from MongoDB now taking longer to execute. This entire issue wouldn’t have existed had we been using Axon Server, as it utilizes a push model for delivering events as opposed to the aforementioned polling model.
The long write times mentioned in the previous issue correlated with the metrics we had, showing a drop in both the read/write performance of our MongoDB cluster. This degradation started occurring as our domain events collection surpassed 100M documents. We noticed that the disk IOPS and read/write latency were growing weekly by double-digit percentages as our system was handling an ever-growing load. Our indexes and working set had become so large that they could no longer be stored in memory, requiring loading from disk.
We addressed this issue by continually adding memory the next few times these metrics showed a drop in performance. After doing this multiple times, we determined, based on our existing load this would become massively expensive and we needed to find a permanent solution to managing the size of our event store.
Our experiences with a severe drop in performance aligned with the findings from a benchmarking done to compare Axon Server to other databases when used as an event store. The main findings of the benchmark highlight the diminished performance as the event store grows when using a database not explicitly built for event storage.
We started utilizing snapshotting in an effort to reduce the read/write load on the domain events collection and associated indexes. This helped for existing aggregates, but the rate at which we were creating aggregates was constantly increasing. As a result, our snapshots did not solve the issue but just worked to increase the time between hardware increases.
On top of just the performance issues, we were incurring costs for storing older events that were read only a handful of times a month. This led us to constantly evaluate the cost of storing these events against if we needed to create an event archival process to keep our hardware costs under control.
This would not have been a very big deal if all we would have to do was remove the older events from the collection and write them to an object storage. But our analytics group was also consuming these events to build out new reporting structures on a very regular basis. So, any changes we made would impact their ability to read these older events. Again, this scenario could have been easily solved with Axon Server using either a backup or a secondary node (tiered storage was not a feature at the time).
After all this, we took a step back and evaluated our choice of MongoDB as an event store. We had invested significant time and energy in performance tuning and troubleshooting some rather elusive issues. And we still needed an acceptable approach to managing aging events. Moreover, the many hours dedicated to these challenges could have been better allocated to developing new features that would enhance the value of our product.
When looking back at the issues we faced, it became clear that managing the performance and maintenance needs to successfully use MongoDB as an event store was very different from the transactional profile we felt comfortable with. Add to this the nuances of the timestamp field and it became apparent we needed a better solution for our event store.
When we finally switched to Axon Server, the migration tool made it easy. On top of solving our event storage processing issues, the switch also afforded us access to both the distributed command and query buses. These two features allowed for new opportunities to expand and scale our Axon Framework-based services without adding additional infrastructure to achieve them.
Conclusion
Looking back on this, I wish we’d spent more time evaluating a move to Axon Server after we put out each fire we ran into using MongoDB as our event store. By continuing to stick with the “easy” choice, we never really admitted to the complexities we had placed upon ourselves as a result.