- High Availability concerns the “art” of making a system deal with local failures in a way that enables it to limit the impact of the failures, and prevent business and users from being impacted.
- AxonServer Enterprise Edition clusters use the Raft consensus algorithm to continue normal operations in the case of individual node failures.
- Replication-group node roles can be used to further optimize the usage of the underlying storage infrastructure, and provide off-site backups for disaster recovery scenarios.
- Extending the cluster’s fault tolerance to include public cloud regional unavailability is possible, but will require more complex deployment patterns and have a definite performance cost
Starting to build applications based on the Axon Framework is mostly influenced by a choice for the Command and Query Responsibility Segregation software architecture, or its support for Event Sourcing. At the same time, we have often argued that using the Axon Framework provides a good way to support a path towards a microservices architecture, as it promotes Event-Driven architectures which dramatically lowers coupling. As a bonus, we can more easily divide large applications across multiple deployments, which in turn allows us to easily scale those components that benefit most from the increased responsiveness of parallel computation. With the increase in “moving parts” however, we also get an increase in possible failure scenarios that we need to take into account.
Application availability is not just a matter of ensuring that all parts of the system are “up”, just as cutting a large monolithic application into a large collection of microservices doesn’t necessarily increase its likelihood of failure. It all has to do with the impact of individual failures on the overall system. Where a failure in a monolith can more easily bring the complete system crashing down, a well-designed system composed of multiple independent components is more likely able to provide alternative flows compensating for those failures. Also, actions needed to restore the system to normal operation are more likely to be localized, while the rest of the system keeps working around it. That said, a team transitioning from the first to the second architectural approach is suddenly confronted with a lot more up-front decisions, that a monolith more easily can bundle or hide.
In this blog, we’ll take a look at some of the choices you have that are provided by AxonFramework and it’s infrastructural companion AxonServer.
Basic Concepts and Definitions
In an AxonServer EE cluster we talk about primary and secondary nodes, but also the terms “active-backup” and “passive-backup” are used, and they tend to cause a bit of confusion due to overlap with common terminology for High-Availability clustering. So let’s start this discussion with some terms and how we use them.
Errors, Faults, and Failures
I’ll acknowledge that most of this will be common knowledge, but these three are pretty basic:
- Errors are the things that are wrong, such as typing mistakes. I’m not talking about the ones the compiler catches, but mistakes in (configured or coded) values, forgotten validations or validations that test the wrong things, or even design mistakes. They can also be more physical things like disk or memory errors, or those concerning networks. (We all know the “Fallacies of distributed computing,” right?)
- Faults are what happens because of errors, with possibly the most famous one happening in the “AE-35 unit” of the HAL-9000. In the context of a Java application, they often take the form of an exception, if you’re lucky, but it might just as well be nothing more than a log message and the use of some default value.
- Failures are the result of faults, and can be generalized as “abnormal behavior.” It could be the (application-) component returns an incorrect or no value, or maybe the component hangs and never returns a response. Probably the most noticeable response would be a crash, but it is entirely dependent on the situation if, and how, you’ll notice that.
In the system as a whole, we additionally need to be prepared for cascading failures, as the failure we notice may be classified as symptoms of the original fault. As a consequence, we should be careful to go off hunting for the error that we appear to have, and instead start with root-cause analysis.
Now, to continue with our narrative, we need to look at how we can deal with failures, to make our system fault-tolerant.
Scaling and Failover
To enable our system to deal with the load we put on it, we need to provide it with the resources needed to pick up the work. Traditionally, this potentially meant adding more “iron”, in the sense that we pick a beefier server; more CPUs, more memory, more disks, or more capable versions of those components. For virtual hardware, this doesn’t necessarily change and may even involve migrating the application component to another physical system. This is called “vertical scaling” and has a big problem: we need to pre-scale for the expected maximum load or choose hardware that can be dynamically resized, and that is exactly what makes this approach very expensive. The bigger problem is that we still have a single point of failure. We can keep throwing money at it and buy more expensive hardware that contains redundant components, but the cost-benefit graph is a nice exponential line, where we end up buying minimal improvements for ridiculous amounts of money.
With disks, the turnaround came with Redundant Arrays of Inexpensive Disks, or RAID arrays, and the same can be done for computing power: We can rework our system to allow multiple copies of components to provide that same service, so when one is unresponsive, the other can be queried instead. This is called “horizontal scaling”, and we can apply it to both hardware and software. Taking the architectural helicopter, you can sometimes notice that the one is used to effectively build the other: if we create a cluster for a central database, have we now used horizontal scaling to ensure the database is always on, or vertical scaling because we still have a single, be it more capable, database? To prevent us from falling into this trap, we need to implement services, as much as possible, as a collaborative effort.
Applications built using AxonFramework fully support this collaborative approach: we can provide multiple handlers for Commands, Queries, and Events, and they use different strategies to make use of the parallelism. With Commands, a single handler is needed for any particular instance, and the framework’s runtime (optionally supported by AxonServer) ensures that one is chosen from the available handlers to process it. If that handler is no longer available, or currently busy, another handler will be chosen instead. QueryHandlers can operate in the same way, but you can also use a divide-and-conquer strategy, or more to-the-point a scatter-gather strategy, to let several handlers collaborate in producing the wanted result. With Events, we do not really care how many handlers will pick it up, as long as the runtime ensures that anyone interested in that particular event type gets it.
High Availability of AxonServer
AxonServer is an infrastructural component that sits like a spider in the web of our communications, ensuring all events are stored for future replays and every client application gets the messages it needs. This central role means we need to look at its status as a single point of failure.
AxonServer Standard Edition
AxonServer Standard Edition has no intrinsic capabilities to support automated failover because it has no concept of a clustered operation. We can improve the situation a lot by using platform features that support some kind of automatic restart based on a health probe. AxonServer has such a feature on its REST endpoint. We can even use some properties to expose it on its own port, making the health probe independent of the “standard” endpoint:
This has the additional benefit that it won’t be exposed publicly if the 8024 port is, so you can keep it unsecured.
Given this check, we now need to choose a platform that provides for fast startup times, so any downtime gets minimalized. For performance and security under pressure however, we’re left with vertical scaling: choose a beefy VM with a fast disk. That is not optimal, however.
AxonServer Enterprise Edition
As far as security and performance are concerned, our best options are with the horizontal scaling options we get from a cluster of AxonServer EE instances. To start with the most obvious aspect: a cluster of AxonServer EE nodes is no longer a single point of failure, as the nodes use the Raft consensus algorithm to coordinate any changes. For every replication group active in the cluster, an election is held to decide a leader, and all nodes acknowledge transactions to it. This is then acknowledged back to the client as soon as a majority of the nodes have committed the change. If a node becomes unresponsive this will keep working as long as a majority is available. If the unresponsive node happens to be the leader, a new election is started and the client either loses its connection (if to the leader) or is informed that the transaction failed. It is then up to the client to re-establish the connection to the cluster and retry, which behavior is partly implemented in the Axon Framework, and for the rest supported through the configuration of retry policies.
To bootstrap clients, the choice was made not to introduce a service registry, but instead give each client a complete list of all gRPC endpoints of the AxonServer nodes in the cluster. Using extra tooling this can easily be added externally to the client during startup if required. Using the list, the client can iterate through a known list of endpoints until it finds one that responds, and the initial response from that node may actually be a request to connect to yet another node depending on load distribution and a possible better match in server tags. A further improvement since release 4.4 is that for reading events, clients are no longer dependent on the leader as the single source of truth, and can read from “followers”. This reduces the dependency on the current leader, at the cost of a slight delay to ensure any last events have been transferred. Naturally, this behavior can be reverted to always read from the leader using the “force-read-from-leader” setting. In a distributed setup, depending on where your major concerns lie, you can focus on performance and keep all nodes close to the clients, or go for availability and do a physical distribution across geographical regions to reduce the chance of a catastrophic loss of the whole cluster.
Raft and Cluster Availability
The Raft algorithm is aimed at providing consensus on values, and AxonServer uses this to consider any data change that is supported by this consensus, as committed. Effectively, for the event store this means that at least half the nodes in the cluster must confirm it, and this process can continue correctly as long as a majority of the nodes are available. A cluster of three AxonServer EE nodes will continue to work and accept events, as long as two of those nodes are still available, meaning running and able to communicate with each other. This requirement is imposed per replication group in the cluster, so if you have a five node cluster with three nodes down, a replication group served by three nodes of which two are the ones remaining available is still fully functional. The Raft algorithm is also used for the cluster’s own configuration data, using a special context named “_admin” in a replication group with the same name. A node servicing this replication group is called an “admin node” and has full visibility on the cluster’s structure.
Contexts and Replication Groups
CQRS and Event Sourcing are strongly connected to Domain-Driven Design, and in the case of the Axon Framework in combination with AxonServer, it has led to the concept of an AxonServer context to provide the boundary around the domain. An SE instance supports a single context named “default”, whereas EE clusters can support multiple contexts. When a client application connects to AxonServer, it specifies the context it belongs to (with “default” being the default) and all its messages and aggregates are strictly local to this context. EE cluster nodes share information on the client applications, registered message handlers, and the (optional) event store, and one or more contexts are combined into replication groups. Replication groups allow you to reserve nodes based on their hardware capabilities or geographical location and apply these to contexts with comparable requirements.
The simplest node role is that of a messaging-only node. This is the only role that disables the event store functionality and it is mainly used in situations where you have an alternative event store. Please see this blog about why we think you really should use AxonServer for that. It can also be used to increase the availability of the cluster for messaging, with the limitation of not really helping for events.
Primary and Secondary Nodes
Nodes that do have an event store are split into primary and non-primary nodes, with the difference being that only primary nodes are candidates for leadership, and the right to vote is restricted to them and active backup nodes. (more on that later) Originally, all nodes with an event store were what we now call primary nodes, and AxonServer SE nodes can be viewed as such also. Naturally, calling this role “primary” kind of implies there are also “secondary” nodes, and with the introduction of version 4.4, this is so. If a secondary node is present in the replication group, and a retention time defined on a context in it, primary nodes will remove event store segments that fully lie beyond this limit, as we have guaranteed access to the events and snapshots that are removed from another source. From an availability perspective, this means that we can now differentiate in storage options, giving the primary nodes smaller and faster devices, while the secondary nodes can use slower alternatives, which also tend to be cheaper. We do however want to ensure that we have multiple copies, so preferably we want at least two secondary nodes.
From a Raft perspective, the secondary nodes can be ignored for leadership elections, as they are not the preferential nodes for replays, for which we want to use primaries. This also means that adding secondary nodes will not affect the number needed for a majority. Event transactions however do involve the secondary nodes, requiring at least one secondary node to have confirmed storage.
The final group of roles concerns the “backup” roles, which are not aimed at failover, but rather disaster recovery scenarios. Active backup nodes are (groups of) nodes that join in the event store transactions, where at least one active backup needs to acknowledge storage (apart from a majority of the primary nodes) for the transaction to be committed. Passive backup nodes receive the data asynchronously, so they are not on the critical path for the transaction, but lose the guarantee of up-to-dateness. From a Raft perspective, the active backup nodes do add to the size of the voter population, as they contribute to the data availability, even though they are not available for leadership.
Backup nodes tend to be most valuable for disaster recovery when located in a different geographical region, providing a similar function as offline backups of the disk storage. Choosing between the two alternatives is a matter of the requirements for up-to-dateness. If you need the guarantee that all committed events and snapshots are available in the backup, use active-backup nodes. Maintenance, or a further level of backups using offline disk copies, can be supported, as only one node needs to stay online for the cluster to be able to function.
If the requirements for backup up-to-dateness is less strict, you can use a passive backup node (or a set of them), because they can receive events and snapshots after the transaction has been confirmed to the clients. This effectively means the stream of data follows at “best-effort” speed, which in practice won’t lag too far behind. Bringing a passive backup node down will not affect the availability of the cluster, nor will temporary connectivity problems. The streaming uses the same strategy used at startup of all node roles, or after other forms of connectivity loss, where the node checks the event store to determine whether to ask for the remaining data or a data file snapshot from the master. As soon as the backlog has been transferred, it goes back to regular follower mode.
The Wider Picture: Going Multi-Region
Naturally, the availability of AxonServer is greatly influenced by the platform on which it runs, not only in the performance-related choices available for hardware but also with respect to the support for monitoring and automatic restarts, as well as regional distribution and redundancy. As discussed in the “Running AxonServer” installment on Virtual Machines, Kubernetes is nearly unbeatable in its ease of use and options for deployment automation, at the cost of a single-region deployment limitation. If your requirements do not include a multi-region deployment and all AxonServer nodes are the same in their configuration of the supported replication groups, then a single StatefulSet will do admirably. Still within a single region, having a StatefulSet per AxonServer node solves any configuration variations. Combined with custom storage-classes, you can even differentiate between disk types and profit from multi-tier storage and backup nodes. Having the cluster and clients in a single availability zone allows for optimal network performance, while a multi-zone deployment increases availability at a relatively small cost in performance.
Multi-region Deployments and Kubernetes
Increasing the scope we can use a single outlier deployment in a different region for a passive backup node, but this will require some networking configuration to ensure the communication between all nodes to work predictably. The choice between a second Kubernetes cluster in a different region, or VMs as the outlier, is highly dependent on the support available for this networking magic, and this is where the choice between Kubernetes and VMs comes up more strongly. Kubernetes itself is not designed as a transparent layer between the applications and the data center in which it runs, as it only standardizes LoadBalancers and Ingresses for HTTP, and we have a gRPC port that needs to be shared. The default Nginx Ingress controller that most providers use can be configured for gRPC traffic, but the picture quickly becomes complicated if we want to use that for a private communication channel, while another Ingress controller needs to regulate public traffic towards the HTTP based UI and REST endpoints. Going for VMs as the only platform solves many of these problems, as we now have a wider range of products available for setting up our traffic lanes, at the cost of losing the ease of deployment that Kubernetes offers. A multi-region multi-Kubernetes-cluster setup is possible with all cloud providers, but the networking products available to connect the clusters differ.
Multi-region and multi-zone deployments can cause performance issues if the load-balancing of clients causes them to be sent to less favorite instances. AxonServer node tagging has been around since the first release of version 4 and can be used to direct client applications to the nearest part of the cluster. The tags are set by adding entries to the property file under “axoniq.axonserver.tags”, for example:
The name of the tag needs to be matched with the tags on the client, for example:
If more than one match is available, the load-balancing strategy is used to choose between the available matches. If no match can be made, the normal (load-balancing) strategy is applied over the whole set of available nodes. As a consequence, if a node goes down, any clients connected to it will reconnect using the same strategies. If you have multiple contexts and the clients for a specific context are in a specific region, it makes sense to locate enough nodes in that region to be able to set up a replication group for it in that region, so all traffic will stay local. If this is too costly, a single tagged node with the default “read from followers” strategy will allow Event Processors to read from the closest node even if they are not context leader, while published events, commands, and queries, may have to take a longer route.
Please note that tag names, because they can be sourced from the names of environment variables, are matched without matching case. This allows “AXON_TAGS_REGION” to be used as well as the above-shown property. If you’re using YAML files for properties, the syntax is probably a bit easier to understand:
As always with this subject, there is no single solution that fits all needs, but I hope you have seen that AxonServer provides a lot of features that can help you improve the availability of your application. Depending on how high you set the bar for being able to recover from increasingly less likely events, you can build clusters, automate real-time backups, and support multi-regional deployments. If you have any questions concerning specific scenarios or would like to discuss deployment plans, do not hesitate to contact the AxonIQ team.