In my previous Blog on running a containerized Axon Server, I spent just a little bit of text on the challenges of using a single StatefulSet scaled to the number of nodes we want in the cluster. Also, I completely skipped the subject of using Helm to manage the deployment. Both subjects do merit more attention, so let’s do just that.

The Kubernetes Deployment Model

Kubernetes is a great platform for managing large amounts of small apps, especially if those apps can be run in multiple instances, with minimal or no state. In short, it is perfectly suited for running microservices and small components like “Functions” or “Lambdas”. However, saying it is only for those types of workloads is selling it short, because you can deploy many traditional stateful applications as well.

A typical deployment in Kubernetes will use a “Controller” to monitor the Pod’s health and restart it automatically as needed. It can also take care of migrating the deployment to a different ClusterNode, if the current one needs to free up some space, or is being vacated for maintenance. The controller is also able to scale the deployment to use multiple copies of the Pod, depending on configured criteria or an explicit target. Surrounding the Pod and its controller, you can add other (optional) components, that can provide network access points and persistence.

Communication to the outside world will use one or more Services, which expose ports to other parts of the cluster, or even the outside world. These services can do this for an individual Pod, but also for a scaled set of them, optionally integrating with Load Balancers and Ingress Controllers.

Given that a Pod is a transient thing, its disk storage will be lost when it is stopped or replaced. This need not be a problem as it can connect to a database for long-living data, but what if the Pod is a database itself? Persistent Volumes are the generalized interface to disk storage and they can be used to connect to (provider dependent) long-living disks.

With that last point, we immediately see where Axon Server may need some special attention when we run it in Kubernetes, as we will be adding Axon Framework-based applications (microservices) that depend on it being available. Additionally, for Event Sourcing to work, we’ll need to be able to guarantee that the data is indeed persistent, so the Pods running Axon Server need not just any available volume, but a volume belonging to the specific Axon Server node running in that Pod. This is what a StatefulSet adds to a Deployment: Pod/container hostnames are made predictable (name of the set plus a number starting from zero) and volume claims that will link the volume to that specific Pod instance. This link will hold when the Pod is e.g. migrated to another worker node, which leads to potentially complex interplays with the provider’s infrastructure when the other node is in a different Availability Zone.

So does this solve all our concerns with respect to Axon Server? Can we cluster it by just scaling the StatefulSet to the number of nodes we want to have?

Pod Health Monitoring and Restarting (updated)

As I noted in the previous section, Kubernetes employs a Controller that can automatically restart a Pod when it is unhealthy, start extra copies if stressed, or reduce the number of running copies if there is not enough work. Several standard implementations of such controllers are available that take care of one or more of these aspects. For this to work, however, we must provide it with some indication of health or load.

Axon Server is built using the Spring-boot toolset, which includes support for “actuators,” which are REST endpoints that can provide such data. For example, the health endpoint for Axon Server Enterprise Editions delivers a comprehensive view of the complete cluster’s status, which means it will return a “WARN” status (and a non-200 HTTP response code) if any of the nodes has a problem. However, using a cluster-wide status for a response interpreted as a local indication presents a problem for Kubernetes. It would prompt a restart of all nodes even when there is a temporary problem on only a single node. Additionally, suppose there is a problem in one out of several replication groups. In that case, the node could still be considered healthy, and other actions could be indicated to restore it to normal operation. For this reason, we advise against using the health actuator as an indicator for “terminal” problems. A better approach is to use the “Info” actuator (at REST endpoint “/actuator/info”), which will always return an “OK” response as long as the application can serve requests.

A part of the monitoring done by Kubernetes concerns its “liveness”. If it appears that the Pod is no longer functioning correctly, “is dead,” so to speak, the Pod will be killed and replaced with a freshly started one. This is done using a “liveness probe” and can be implemented using the Info actuator mentioned in the previous paragraph. The problem is that you need to get it to account for a possibly long startup time. When Axon Server starts, it will first perform several consistency checks on the event store. For example, if the event store was copied from a backup, and no indices were found, it will generate those before declaring itself ready for service. Even if those kinds of activities aren’t needed, a large number of contexts can cause a significant wait while it checks the last few segments for each. If, at the same time, we have a liveness probe that requires some kind of response within 5 seconds, Kubernetes may decide the node isn’t healthy and replace it with yet another new one. For this, there is a second type of probe named the “Startup” probe. This one can be given a much more generous limit to account for any possible startup activities.

The third probe is the “Readiness” probe, which is probably the most unexpected one in its implications. This probe is used during the entire lifetime of the Pod and determines if it is really “open for business.” If not, Services will not route requests to it, and (especially important in the case of a StatefulSet) DNS entries will not be active. If the readiness probe for an Axon Server node fails, other nodes will get failed hostname lookups when they try to contact it, and a potential load-balancer on the Service for the HTTP port will skip it. If you are directing clients to nodes using a service, this will also fail. From an Axon Server perspective, this is particularly annoying because Kubernetes is in effect taking decisions that should be left to Axon Server’s implementation of the cluster availability algorithms.

To conclude this discussion of the probes, it is best to define a generous enough Startup probe to allow for any potential startup activities until the Info actuator becomes available. From that moment on, the Liveness probe can verify that Axon Server is indeed up and running. If Axon Server has any problems, this should not be taken as a blanket authorization to kill nodes and restart them. The problem may very well be localized to a single context or replication group, and restarting the node could then cause additional service interruptions for other contexts or replication groups. Similarly, the Readiness probe is best left pointing at the same Info actuator, leaving it up to Axon Server itself to determine if it can provide the requested service to a client.

The final point we need to look at is concerned with the behavior at shutdown. When a Pod is deleted, the Pod’s main command is sent a SIGQUIT signal, after which the Controller will wait a configurable amount of seconds before it decides the process is unwilling (or unable) to perform a clean shutdown. The second attempt uses a signal named SIGKILL, which is precisely as destructive as it sounds. For Axon Server, a SIGQUIT is taken as a request to shutdown cleanly, causing it to flush all its buffers to disk. A SIGKILL will prevent this from happening and result in inconsistent data in ControlDB, data files, or indices. These could then cause trouble at the next start or even later when it tries to continue storing data where it left off. For this reason, it is better to use a larger value than the default of 30 seconds, or at the very least monitor actual shutdown performance, to get a notification when a larger setting becomes necessary.

Kubernetes Scaling and Axon Server’s “Statefulness”

As stated above, Kubernetes controllers can scale a Pod to get several, initially identical, copies of our application. From a Kubernetes perspective, these multiple copies should serve the same purpose, so it doesn’t matter for the “application” which of these copies picks up a request. I put the word “application” between quotes because it doesn’t really matter who is providing the work that these Pods will perform. It may be some external application that sends in requests through an API, or the Pods themselves may be part of an application that requires their services. If the load on the Pods increases, you could simply add new copies (scaling up), or when the load decreases you can shut a few down (scaling down). If the Pods use Persistent Volumes (PV for short) this may help to reduce the startup time, allowing them to continue where they left off. When using PV Claim (or PVC for short) Templates with StatefulSets, you provide Kubernetes with a template for the claim on the amount of space that volume needs to provide, so it can create actual claims when the Pod is first instantiated, and the Kubernetes implementation will take care of finding some way to fulfill that claim. Because the name of the Pod is predictable, and the PVC will be linked to that particular Pod-name, it will stay around even when the set is deleted, ready to be reused as soon as it appears again.

For Axon Server, a cluster is built from nodes that share the same application, but their states may turn out to be wildly different. If you have a cluster with five nodes, you may have different sets of three nodes (out of those five) sharing a replication group. This means that, if you shut down a random node, some replication groups may be downgraded from three to two nodes, while others still have all three nodes available. If you shut down a second node, you could have replication groups with only a single node left, which would render them unable to reach a consensus on the data to be stored. The Kubernetes model on the other hand assumes that having a certain number of Pods available is enough, not a selection of specific ones. Also, if for whatever reason Kubernetes needs to vacate a worker node, let’s say that this node needs to receive a software upgrade, then all Axon Server Pods that happen to run there will need to be migrated, and we need to ensure this does not influence the availability of the Axon Server cluster. 

The final problem with the Kubernetes scaling model has to do with being able to shut down specific Axon Server nodes for maintenance. Even though a lot of effort is put into ensuring that no data is permanently lost, reality forces us to keep in mind we may be forced to help recovery along with some manual work, say in the event the PV of a single instance is lost due to hardware problems. A StatefulSet can be scaled down until the Pod we want to stop has been stopped, but that also stops all Pods with a higher number, even if there was nothing wrong with them.

For Axon Server in Kubernetes, the best approach is to use “Singleton StatefulSets”, which means that each node gets its own StatefulSet, and it is never scaled above 1. This way we still get all the advantages of predictable names and linked storage but are free to give each node a unique identity and state. This may sound as if we throw away the advantage of dynamic availability management that Kubernetes provides. In the case of Axon Server, however, this happens to be what we want because we want the Axon Server cluster to take care of this. If a new client app connects, and the node it connects to is already serving a number of clients, then that Axon Server node will have a full overview of the load on each node in the targeted replication group and can redirect the client to the ideal alternative.

Persistent Data

Axon Server provides CQRS and Event Sourcing applications with an event store that plugs seamlessly into the Axon Framework. Event Sourcing changes the task of maintaining the state of aggregates, from storing and updating the state itself in a regular database, to storing and replaying the events that led to this state in an event store. This also changes the storage requirements from needing enough space to only store the space for all aggregates, to needing space to store the history of all aggregates, even when their lifespan has ended. This means we gain an immutable history of our aggregate’s life, at the cost of an ever-increasing need for storage space. More importantly, we gain a simpler access pattern: we replace an updatable store with an append-only one.

As discussed in the previous section, we need to ensure that Axon Server will use Persistent Volumes for practically all of its data, but the event store can benefit from a specialized underlying infrastructure component, given its eventual size. Cloud providers will generally allow you to specify which of their storage solutions need to be used for a specific volume using a StorageClass specification, but you have to be careful because it can also influence access patterns. If you simply use the default StorageClass, you may lose the option of having multiple Pods connect to the same volume, limiting your options for creating backups. 

Another aspect of the storage class might be the accessibility across geographical locations. Shared file-based storage commonly adds a layer of redundancy to enhance availability, but it also tends to put restrictions on the number of IOPS because it is a shared medium.

Backups

Using an Axon Server Enterprise Edition cluster, the event store will be replicated among the nodes, which also helps increase its availability. However, if it is deployed to a single Kubernetes cluster, all nodes could still be physically located in the same Availability Zone or Region. This means that a failure that results in an entire region being lost, even though unlikely, could still lose your data. Even if the data is not lost, you won’t be able to access it to build a temporary replacement cluster in another region. So we need to look at backups, preferably in another region.

For backups, there are basically two approaches: copying files and using backup nodes.

Copying Files and File System Backups

The simplest form of event store backups, are those where you make a copy of the files. Axon Server provides REST endpoints that can help you in this process:

  • The “/createControlDbBackup” endpoint will cause a ZIP file of the Control DB to be stored at a configurable location, which you can then retrieve and store. For Axon Server EE the “/v1/backup/createControlDbBackup” endpoint will do the same on all nodes of the cluster.
  • The “/v1/backup/filenames” endpoint returns filenames of the event store segments that are closed, meaning they will no longer change. This can be requested for any context, both for events and snapshots. The current segment can be copied also, but it will only be partially filled, so you’d preferably also keep a copy of the Control DB from at most the same age to know where to continue.
  • The “/v1/backup/log/filenames” endpoint provides a similar list of files in the replication log of a specified replication group.

Using these endpoints you can make incremental backups of the event store and Control DBs, but you’ll need access to the volumes. 

An alternative solution can be to simply use snapshots of the disks, using your provider’s facilities to store these in another region. Procedures like a partial restore may be a bit more complicated, but converting the backups into usable volumes is a lot easier. You have to be careful though and ensure this uses an incremental strategy, or you’ll end up with increasingly longer backup times.

Backup Nodes

A very simple approach to the problem of creating backups is to use Axon Server’s own replication mechanisms. You can add a remote instance of Axon Server to a replication group in the role of a Passive or Active Backup node

A Passive Backup Node is a passive follower, in the sense that it receives all data for the event store, but is not a part of the storage transaction. The immediate advantage is that client applications won’t have to wait for the remote site to get a copy of the data. The delay is typically measured in milliseconds, depending on the speed of the network connection to it. If you want to make proper disk backups and shut this remote node down while you do this, the cluster will keep functioning normally. When it comes back online, the current leader will send any data it missed.

If your requirements for having an up-to-date backup however are strict, you may want to accept the delay and use a pair of Active Backup Nodes. If a replication group contains one or more of these, at least one must acknowledge storage of data to be able to confirm the commit back to the client. This “at least one” allows us to ensure we have a full backup, while still enabling us to shut one down for a file-based backup or to perform upgrades.

Multi-region Clusters

As touched upon several times now, it can be really advantageous to have a cluster spread across multiple regions. Not only for performance, when we have users in those different regions, but also improved availability and disaster recovery. Kubernetes is however not really designed to have location-aware apps. I’ll admit you could actually do this, using tags in a deployment’s metadata section, but all major cloud providers restrict their managed Kubernetes solutions to a single region. 

There is some movement though, seeing that (for example) Google announced that GKE can have Services span across clusters, which could help achieve this. The amount of “handwork” needed to achieve this is, however, still significant and starts with VPC and firewall configurations. So, how to achieve this, without sacrificing Kubernetes as your platform, will require work outside of the Kubernetes platform. However, this work is mainly to allow for the network connections between the two to work seamlessly.

The Question of a Helm Chart

In the “Running Axon Server” repository on GitHub, I put an example of a Helm Chart for Axon Server SE. So naturally, the question has come up of putting it in an official Helm repository. As the above discussion hopefully made clear, there are a lot of knobs to turn for your deployment, and they are often quite tightly linked to the products of your cluster’s provider. I have seen a number of charts that solve this by allowing you to provide a lot of details in the values, at the cost of very complex “template programming”. For a cluster of Axon Server EE nodes, where these nodes are each configured as a singleton-StatefulSet, even a “simple” chart will require quite a bit of Go Template language use, with several loops and lots of choices.

I would like to start with the following observation about Helm itself: it is a tool for streamlining the installation of Kubernetes hosted applications, where the application consists of multiple deployable components. These components (Pods, Services, StatefulSets, etc) need to have synchronized customizations, while the customizations can be driven by a relatively simple set of values. You want to do this in such a way that you can upgrade the application by “simply” applying updated deployment descriptors. In this view, the Helm Chart is a tool that captures a set of choices, resulting in a “Good Enough” deployment for most situations. The more choices are left open, the more complex the chart becomes both to use and to maintain. This becomes exponentially worse if the Chart has to allow for more than just upgrades to a new version of the application, but also has to take into account that you may want to redo the installation with different choices, while still retaining the data. And then we apply “lessons learned” to upgrade the Chart itself with more and different choices…

I hope the point I wanted to make here comes across. Tools like Helm and Bosh, but also Terraform and many, many others, are great for getting an automated deployment process. However, once we reach production, the most important thing we want for our infrastructure, and from the perspective of an Axon Framework user we definitely consider Axon Server a part of that, is a stable situation. Upgrading Axon Server should take into account our regional deployment of nodes, and respect the replication groups and their need for being able to continue operations while the change is executing. 

This means that these tools are most valuable when we initially deploy an application or are in the process of restoring an installation as part of a disaster recovery scenario. They are even more valuable if they can capture standards across installations, allowing us to quickly create templates for new installations based on how these need to fit in with the rest. For upgrade scenarios, we want to minimize the changes needed and may even need to add special steps that only apply to that particular upgrade. If we want to change installation details, so a full rebuild of the cluster results in something different from what we have now, the change will need even more care to prevent us from thinking we need a rollback procedure. And nobody wants that, right? We always “fix forward”!

So, “It Depends” Wins Again?

The most feared reply from the standpoint of a customer and the most often true one for the consultant is “It depends.” Sure, there are lessons learned and best practices, but they almost never result in a detailed and standardized playbook for every situation, if only because no organization that only wants the off-the-shelf solutions is ever called “Acme Inc.”, and no product details page starts with the phrase “Lorem ipsum”. 

My personal preference is a set of tools or procedures that allows me to automatically replace an Axon Server instance with a freshly built one, and do this once per node, every step checking all nodes are happy before we move on to the next one. I don’t mind having to do a manual change to ensure necessary conversions or updates have been applied first, as long as I can trigger an automated rebuild immediately afterward. This will also allow for updates on the base underneath Axon Server, as in the case of an upgrade of the Java Runtime.

A Public Image for Axon Server SE and EE

For some time now we have offered a public Docker image for Axon Server SE on Docker Hub. The first image was created “by hand” and used a Dockerfile that started Axon Server using a shell script. Environment variables were defined to easily adjust the configuration but resulted in an unnecessarily complex startup. Since version 4.1.5 the build has changed to use Google’s Jib plugin for Maven, which forced us to abandon the startup shell script because the plugin uses an “exploded” installation and starts the application directly. Environment variables can still be used because Spring Boot allows you to use an all-uppercase version of Java property names, also exchanging the dot for an underscore. Another change was that we explicitly declared two volumes, one for the event store at “/eventdata” and the other at “/data” for the other files and directories that Axon Server wants to write to.

Although easy to build and use, this caused a few (potential) issues:

  1. Axon Server is still run as user “root”, which is not just bad from a security point of view, but a real problem when you want to use it on OpenShift, which has non-root containers as a policy. Also, the installation directory (in “/”) is not ideal in this respect, as the (virtual) filesystem should conform to Linux standards.
  2. The (default) distroless images do not provide a shell, which can be a problem in development environments where you want to “go inside” the container to see what is happening.
  3. Extensions (a feature of Spring Boot executable JAR files) no longer worked. This feature allows you to put jar files in a directory named “exts”, which will then be added to the classpath. We use this in version 4.5 for external authentication/authorization add-ons in Axon Server EE.
  4. Although we always warned people that the image was for convenience, but not intended for production deployments, it was used as such anyway. The 4.1.4 to 4.1.5 change in the image immediately caused trouble for those who used “:latest”, or upgraded without reading all of the README.

Since the release of Axon Server 4.5, we revamped the build of our internal EE image, and when we feel confident enough with it, we’ll risk a second change to the public image to fix some of the issues above. The new SE image will run in “/axonserver”, and we’ll build versions based on all 4 distroless variations which will add “debug” (for the shell), “nonroot” (for an “axonserver” user), and “debug-nonroot” (for both) versions. It will also add “/axonserver/exts” to the classpath and declare a volume for it, as well as adding a volume for “/axonserver/config” so you can place a property file in it.

Let’s keep in touch

If you’re not already a member, join the community at “discuss.axoniq.io”. Updates and releases are announced there, and you can ask your questions and help others with theirs. You might also be interested in the PodCasts that are available on the “Exploring Axon” PodBean.

Subscribe to blog notifications