Replacing StatefulSets with a custom Kubernetes operator in our Postgres cloud platform

Replacing StatefulSets with a custom Kubernetes operator in our Postgres cloud platform

Host:

  • Bart Farrell

Guest:

  • Andrew Charlton

Discover why standard Kubernetes StatefulSets might not be sufficient for your database workloads and how custom operators can provide better solutions for stateful applications.

Andrew Charlton, Staff Software Engineer at Timescale, explains how they replaced Kubernetes StatefulSets with a custom operator called Popper for their PostgreSQL Cloud Platform. He details the technical limitations they encountered with StatefulSets and how their custom approach provides more intelligent management of database clusters.

You will learn:

  • Why StatefulSets fall short for managing high-availability PostgreSQL clusters, particularly around pod ordering and volume management

  • How Timescale's instance matching approach solves complex reconciliation challenges when managing heterogeneous database workloads

  • The benefits of implementing discrete, idempotent actions rather than workflows in Kubernetes operators

  • Real-world examples of operations that became possible with their custom operator, including volume downsizing and availability zone consolidation

Relevant links
Transcription

Bart: In this episode of KubeFM, I spoke with Andy Charlton from Timescale about the technical limitations of using Kubernetes StatefulSets for managing Postgres workloads, and why Timescale decided to build a custom operator instead.

In this episode, Andy and I discuss the following topics:

  • Why StatefulSets fall short for managing high-availability Postgres clusters

  • Challenges around pod ordering, volume resizing, and immutable storage templates

  • The role of Kubernetes operators in encoding operational context and reducing disruption

We talked about how their in-house tool, Popper, matches instances to specs to handle complex reconciliation logic. If you're working with stateful applications on Kubernetes or hitting the limits of what StatefulSets can do, I'm pretty sure this conversation might be helpful.

Today's episode of KubeFM is sponsored by Metalbear MirrorD. If you're a cloud developer, you know the pain of spending more time setting up environments than writing code, only to test in conditions that don't match production. MirrorD lets you run microservices locally with seamless access to cloud resources, staging services, databases, and more, all from your IDE. No more waiting on CI or fighting for staging. Just install the CLI or IDE plugin and start testing from your local machine. Try it free at mirrord.dev.

Now, let's get into the episode. Welcome to KubeFM. First and foremost, what are three emerging Kubernetes tools that you are keeping an eye on?

Andrew: I'll be totally honest. I'm quite a slow tool adopter. I'm still rocking Vim as my editor. We're talking about going to multi-cloud here at Timescale. Things like Control Plane, Cluster API are what we're looking into at the moment. I won't promise we're going multi-AZ or multi-cloud before we release it—I'll get in trouble at work. We've recently adopted Node Problem Detector because we've had a few issues with nodes and are trying to work out exactly what's going on. Node Problem Detector has been a huge help.

Bart: Very good. As a follow-up to that, we've asked folks in the past, what do they think is harder to learn, multi-cloud or surfing?

Andrew: I'm a big fan of the cloud. All my career has basically been in the cloud. But I will also say, having worked with AWS, Azure, and Google Cloud, they're all terrible in their own ways. They're all great in many ways, and they're all terrible in their own ways. Multi-cloud is viable if you can identify where things are terrible and work around them. I think it's okay, but it's hard. You often have to go to the lowest common denominator.

Bart: I'm a staff software engineer at Timescale.

Andrew: We are a time series database company, but really a PostgreSQL on cloud company. We're moving into the AI space as well. I'm a staff software engineer working on the platform team. Prior to this, I worked at Influx, again on the software storage team, in a very similar role. That's why orchestration is the niche I seem to have fallen into. I'm writing operators and working out how to run databases on the cloud.

Bart: And how did you get into Cloud Native? What were you doing before Cloud Native?

Andrew: My history is a bit chaotic. I left university and went into teaching. I was a school teacher for over 10 years, teaching secondary school (high school) math in the UK. My degree was in electrical engineering, and I had done some programming. I started teaching computer science and moved up into management, where a lot of my work involved data analysis.

I hated repeating tasks like writing spreadsheets, so I started writing Python for data analysis. There was a lovely local Go community, and I began writing Go and creating dashboards for data processing. Data analysis is quite significant in UK schools, and I had progressed to leadership roles where this was a key part of my job.

Eventually, I realized I was enjoying writing code more than teaching. I moved to a local email marketing company, which was a great grounding. It was entirely cloud-based on AWS, using Lambda, Batch, SQS, and other AWS services. From the start, my professional career focused on leveraging cloud scaling capabilities.

I then moved to a fintech, also running on AWS. A few years into my career, I joined Influx, which was my first interaction with Kubernetes. With my mathematical background, I'm particularly interested in scheduling problems and provisioning. Moving to a database company doing Kubernetes aligned perfectly with my interests.

After a couple of years at Influx, I moved to Timescale, continuing similar work. That's my background.

Bart: The Kubernetes ecosystem moves quite quickly. What's your method for staying up to date with all the changes?

Andrew: It is quick, isn't it? It's so hard. I'll be honest. When working with a data-based company, people typically choose a database because of its features and data model. Many come to Timescale because it's Postgres, with great support, history, and stability, plus excellent features.

Once a customer chooses a database, it tends to be sticky—you don't move unless it becomes unstable or too expensive. My baseline is stability, and I'll admit I'm a bit conservative. We don't jump to alpha features but wait for them to mature.

Release notes are obviously the best way to understand what's coming up. However, it's challenging to keep up and understand what will impact your business and what opportunities exist. I think that's the hardest part of the cloud journey: understanding what's coming, how to leverage it, and identifying potential problems on the horizon.

I know this isn't a great answer. I've essentially just told you it's hard, and I'm a little behind the curve.

Bart: We've done five seasons of this podcast, and I think all of our guests would agree that it is hard. There is no silver bullet, no magic solution, no shortcut. Everything will involve trial and error and patience. As we mentioned, boring is better.

When we celebrated Kubernetes' 10th birthday last year, I was thinking about how the next 10 years will look. Many people hope it becomes something boring that blurs into the background. Kelsey Hightower calls it a platform for building platforms. We've had podcast guests tracing the history of Linux and looking at Kubernetes side by side. Linux is obviously over 20 years older than Kubernetes. But there's nothing wrong with something being boring—we know it just works, and we don't have to worry about it so much.

If you could go back in time and share one career tip with your younger self, what would it be?

Andrew: One thing that's really important, especially as a young developer, is taking things seriously and not just superficially. The big thing for me is responsibility and taking ownership. If I'm doing something, I'm going to own it and see it through from conception to production, monitoring it afterwards.

There's an element of professional pride in ensuring my work doesn't break. Whatever you're working on, own it and make sure you understand it well. If you're just cranking through Jira tickets without really understanding what's going on, that's not a great place to be learning.

To improve and get better at your craft, you've got to own things, dig in, understand them deeply, and be responsible. When the on-call alert pings, you've got to fix things. My tip to younger developers: Own your stuff.

Bart: The next question is about an article you wrote called "Replacing StatefulSets with a Custom Kubernetes Operator in our PostgreSQL Cloud Platform". Today we're discussing how Timescale replaced Kubernetes StatefulSets with a custom operator for the PostgreSQL Cloud Platform. Can you start by explaining the role StatefulSets play in Kubernetes and why you need something different?

Andrew: StatefulSets are great. When you first start developing any stateful application, StatefulSets are an excellent early tool. They let you have persistent storage. If you're setting up an Elasticsearch cluster or something you want to run locally to store logs, StatefulSets are great. You want persistent volumes and a predictable upgrade process.

The trouble comes when StatefulSets only know about pods and PVCs without any broader context. As you grow and develop your stateful application, you'll realize there's much more complexity than just pods and PVCs. In a PostgreSQL situation with high availability, you have a primary and replicas, and it becomes important to know which is the primary. Other contexts like backups and snapshots become relevant when provisioning pods and PVCs.

StatefulSets are great for homogeneous workloads with simple persistent storage needs. But eventually, you'll want your application to become more sophisticated. You'll want to make more intelligent decisions and utilize the additional context to improve your application. At that point, you'll start to find limitations with StatefulSets.

Bart: Let's dig into Kubernetes operators. For our audience who might be familiar with standard Kubernetes controllers, could you explain how operators extend the Kubernetes control plane and why they're particularly valuable for stateful applications?

Andrew: When talking about context and making business or operational decisions, that's where a Kubernetes operator comes in. If you've written a StatefulSet with a spec and template, a custom operator does almost exactly the same thing. We've got a spec, a Custom Resource Definition (CRD) that defines the spec, and a custom resource that tells us what we want—likely pods, PVCs, and other resources.

The operator has the spec and custom resource, knowing what the pods and PVCs look like, but it also has context. It knows which PostgreSQL pod is the primary, the backup situation, and details like which availability zones these resources are placed in.

When you want to make more intelligent decisions about scheduling pods, provisioning, deletion order, or updating images, an operator helps. A basic example is that a StatefulSet has no idea which pod is the primary. Most of the time, it might kill the primary first and trigger a failover, which you want to avoid. You only want to restart the primary once if possible. That's where an operator comes in.

Bart: The article details several specific limitations with StatefulSets in a PostgreSQL context. Could you elaborate on the most problematic Kubernetes behaviors you encountered, particularly around volume management and pod ordering?

Andrew: These were the two big issues: pod ordering and StatefulSet behavior. Let's consider a PostgreSQL cluster with a primary and replica pods (0 and 1) where we want to update the image.

A StatefulSet operator would kill pod 1, but there's a potential problem because the StatefulSet doesn't know the condition of pod 0 or whether it's ready to take over. It kills pod 1, hoping for a correct failover, but this might not always happen. We can try to improve this with more intelligent health checks, but the fundamental issue remains.

The process typically involves killing pod 1, then pod 0 becomes primary, then killing pod 0 and bringing it back up. This creates a disruptive cycle where you're always terminating the primary first, which can be significantly problematic for some customers.

Volume management in StatefulSets also presents challenges. The volume template is immutable, so if you initially provision a 50GB database and later need to increase storage, you can't simply edit the StatefulSet template. You must use external methods to patch the PersistentVolumeClaim (PVC) and increase its size.

Adding a replica becomes complicated because the new replica might have an incorrectly sized volume based on the original StatefulSet configuration. This requires external services to provision PVCs, essentially making the StatefulSet operator unreliable for volume management.

At Timescale, they often see customers' databases grow, then shrink through compression and data tiering. However, downsizing a volume is impossible because you can't shrink a file system. Creating a new replica with a smaller disk becomes a complex orchestration problem, especially when you want to keep a specific pod that isn't pod 0.

The StatefulSet's rigid pod ordering and requirement that the first pod must always be pod 0 creates significant operational challenges. Ultimately, they found themselves spending more effort working against StatefulSet constraints than managing their database effectively.

Bart: Understanding the reconciliation model is crucial when designing a custom Kubernetes operator. What were your key design goals for Patroni sets and how do they address the limitations of the standard Kubernetes reconciliation approach? It might also be helpful for listeners unfamiliar with the Patroni Zalando example if you want to provide some context.

Andrew: Patroni is a tool for managing PostgreSQL in the cloud. It handles which pod is the primary, manages switchovers, streaming to replicas, and other related tasks. When we spin up a PostgreSQL pod, PID 1 is always Patroni, which launches and manages PostgreSQL within the pod. Patroni annotates its own pods to indicate their status, such as whether they are the master/primary or a replica.

Regarding the reconciliation loop, for those unfamiliar with controllers or operators, it's about achieving a desired state. The cluster's current state (pods, PVCs) is compared to the desired state defined in the Custom Resource Definition (CRD).

Our operator's design goals were to handle all intermediate states when transitioning the cluster. For example, when downsizing, instead of manually managing multiple steps, the operator can intelligently handle the process. If you want to reduce a volume from a terabyte to 400 gigabytes, the operator will:

  1. Spin up a new replica with the smaller disk

  2. Perform a switchover

  3. Remove the old replica

As a database company, availability is crucial. We aim to minimize disruption by making intelligent decisions about actions that might interrupt database connections. Unlike traditional StatefulSets with strict pod ordering, our approach prioritizes keeping the primary pod regardless of its ordinal number.

The key objectives are to:

  • Minimize disruption

  • Reduce the number of actions required

  • Focus on transitioning to the desired state with minimal intervention

Bart: The instance matching concept seems central to your implementation. Can you explain this pattern in the context of Kubernetes reconciliation loops and how it helps manage StatefulSet workloads more effectively?

Andrew: This is a consequence of removing pod ordering. When you think about StatefulSet, everything is homogeneous with pods 0, 1, 2. If you want to remove something, you just remove pod 2, then pod 1, and then pod 0. This works fine because everything is the same.

Once you add context about which pod is the primary, you're in a different situation. We also wanted to move beyond homogeneous workloads. We want the ability to have a potentially smaller, cheaper replica that can keep up and serve some queries. This means we might have heterogeneous pods and PVCs.

When you remove the ordering and homogeneous nature of your setup, it becomes more difficult to define which pod must be zero or one. Perhaps pod 1 is the primary, and we want that to match the first replica in our CRD. Working out how the cluster matches the specification becomes a more complex problem.

You have a spec that defines the number of replicas, the nodes, and the specifications for each pod. Then you have the actual cluster, which might be out of spec. The primary pod adds another layer of complexity.

Different changes have different levels of disruption. If a PVC is too small, you can patch and resize it seamlessly. Changing the image requires restarting the pod. Downsizing a PVC that's larger than specified might require replacing the entire thing.

The core challenge is how to match the desired state to the current state while minimizing actions and disruption. Our Patroni set operator, Popper, focuses on this exact problem—matching what we want with what we have. Determining the primary pod plays a crucial role in this process. If you have two pods in spec but only want to keep the primary, you'll remove the replica.

Bart: You took an interesting approach to implementing actions in your operator using discrete actions rather than workflows. This seems to align with Kubernetes' own controller pattern. Could you elaborate on the design choice and how it made your operator more resilient?

Andrew: There's always a temptation, as software engineers, to say, "I could just do that." Let's take a simple example: I've got a cluster up and running, and I want to add a replica. I go through my reconciliation loop, thinking about adding a PVC, creating a pod, and going through the process.

I've seen two kinds of design choices when creating an operator. One approach is to look at the spec and current state, identify what's wrong or out of spec, and fix everything in one go. I'll create a config map, a missing secret, and update things because I can.

The challenge comes when, for example, adding a replica. If creating a pod fails after successfully creating a PVC, your workflow needs to be resilient. You must handle scenarios where the PVC already exists and create the pod accordingly.

Our process for creating a replica is more complicated because write-ahead logs (WAL) is slow. We want to minimize that time, so when spinning up a replica, we first get a new backup—recovering from a backup is much quicker. This becomes a drawn-out process where you must be tolerant at each stage and consider what happens if something fails.

With stateful workloads, especially those with large data sets, you can't rely on a single reconciliation loop to do everything. If spinning up a PostgreSQL replica with 10 terabytes of data takes 10 hours, you can't block the entire process.

Our approach breaks down potential actions: creating a PVC, patching a PVC, creating a pod, deleting a pod, requesting a backup. During reconciliation, we iterate through potential actions, asking two key questions for each: "Do I want to do this?" and "Can I do this?"

For each action, we define specific conditions. For instance, when creating a PVC for a new replica, we check: Do we need a new replica? Is there a recent backup? This prevents unnecessarily replaying WAL.

We also incorporate maintenance windows, allowing customers to specify when certain actions can occur, like performing a switchover or upgrading PostgreSQL versions.

This approach makes debugging easier. By considering each action individually and making them idempotent, we can clearly understand why an action was or wasn't taken.

In our testing, each reconciliation loop performs a single action. We verify the sequence: performing a switchover, then deleting the old primary pod, then deleting the old primary PVC. This ensures precise, controlled stateful workload management.

Bart: Let's talk about practical results. Could you share some specific examples of Kubernetes operations or maintenance tasks that become possible with Patroni sets that were difficult or impossible with StatefulSets?

Andrew: So, downsizing volumes has been a big focus. Our Kubernetes operator is availability zone (AZ) aware. Back in May last year, we did a big AZ consolidation. We were split over five AZs in some zones at one point, which was terrible for node packing efficiency. We were able to consolidate down to two or three AZs depending on the region, maintaining the same availability. We could simply configure Popper's config to specify AZs 1a and 1b, and it would automatically handle replica placement and switching during maintenance events without any noticeable disruption.

One of the key challenges we address is minimizing downtime. Customers often resize their database instances, moving from a 4 CPU to an 8 CPU instance. Previously, this could be problematic if there wasn't room for the 8 CPUs on the existing node, potentially requiring a new node provisioning that could take several minutes.

During maintenance events like Kubernetes upgrades, we want to retire existing nodes and create new ones lazily, scaling up nodes as required rather than provisioning a large number of instances upfront. We've developed strategies to manage this, such as applying taints to retiring nodes and creating a "scout pod" with a low-priority class.

When resizing an instance, we create an over-provisioner pod with a PVC to check if the pod can be placed on a node. This pod has a very low priority, allowing the actual database pod to evict it when restarting. This approach ensures minimal downtime, as we verify placement before restarting the database pod.

We also migrated from EXT4 to XFS file system, which showed better performance for PostgreSQL in our testing. Using Patroni, we can change the file system in the Custom Resource Definition (CRD), and it handles spinning up new replicas just before the customer's maintenance window.

These optimizations—involving intermediate states and complex placement strategies—would have been impossible with traditional StatefulSets. Our approach allows for more flexible and efficient database management in Kubernetes.

Bart: Now, looking back at this entire process of designing and implementing Patroni sets, what would you do differently if you were starting this project today?

Andrew: Whenever you build any project, there's always an element of iterative design. One of the key things we got to later was how we tested the operator. We went between different approaches. Getting the testing locked in is so important.

I love working in the Popper code base now because the test suite is so good. We use nftest for testing. We don't run kind or anything because I don't want to test that PostgreSQL will come up and do its thing. But we've got really good testing about the sequencing and how they interact with Patroni and our backup clients.

Really locking in how you test has been the biggest productivity gain. I'm a huge fan of test-driven development, but when you first start hacking and iterating, it feels like writing tests seems wasteful because you don't know what your final API will look like.

Looking back, I would have tried to accelerate locking in the test framework because it saves your life. Being able to run my test suite and know that everything is still working, that I haven't broken things, is invaluable.

Bart: Bart Farrell: Andy, I've been having conversations with people about writing stateful workloads on Kubernetes since 2020. Five years later, there are still folks in the ecosystem who simply say that Kubernetes isn't ready for stateful workloads or that it's still a really bad idea to do so. What's your answer to people who still have that opinion?

Andrew: Let's be honest. Running stateless applications on Kubernetes is easier. There's no doubt about that. You don't have to worry about data sequences or the weight of data. You can simply create more replicas. That doesn't mean it's impossible.

When I think about alternatives, if I'm going to run a database in the cloud, people want databases as a service. Do I want to run this on AWS bare metal or rent specific EC2 nodes? No. Kubernetes provides a wonderful abstraction. I'm a huge fan of Kubernetes. It's not perfect, but it has all the necessary abstractions. PVCs have everything you need to run things well.

There are challenges. We run our own clusters and don't rely on managed Kubernetes because it can become problematic. If AWS is deciding when to restart nodes during a Kubernetes upgrade, that becomes an issue. You can't just provision an EKS cluster and throw a database on there, expecting 100% uptime.

You've got to be more considerate about when your database can go down, understanding when that is acceptable and under what conditions. It's harder, but that's half the fun. Most engineers get into this problem space because we like solving challenges. I've found my niche, and I enjoy it.

Bart: Speaking of roles, niches, and solving challenges, would you care to comment on the wall of miniatures behind you? It seems like quite a challenge to paint them. Have a lot of patience, trial and error. Tell me about that.

Andrew: I'm a complete Warhammer nerd. This is my peaceful time, just sitting and painting and playing Warhammer, but actually, it's a deeply frustrating hobby. I'm trying to paint these things to look good, and it doesn't always quite work. It's just more stress to my life.

Bart: Much like running a database on Kubernetes, we might say.

Andrew: Database on Kubernetes is much easier.

Bart: And maybe less expensive in some ways, depending on your collection.

Andrew: Absolutely.

Bart: That's cool though. What's next for Timescale?

Andrew: Next things here at Timescale are potentially multi-cloud, looking at other cloud providers like Azure and maybe Google Cloud in the future. I've been doing a lot of work on volume scaling. I go where I'm required, and it looks like Azure is up next to see how that works out.

Bart: Andrew Charlton suggests people can get in touch with him through the Timescale community Slack.

Andrew: There's a Timescale community Slack where I'm on, or LinkedIn. If people want to message me, that's fine.

Bart: Well, Andy, it's been a pleasure having you with us. We've had a couple of other guests speaking about Postgres operators. I really enjoyed the depth you brought to the conversation and look forward to crossing paths with you again in the future.

Andrew: Thank you very much. It's been an absolute pleasure.