Which Kubernetes PostgreSQL operator should you choose?

Which Kubernetes PostgreSQL operator should you choose?

Host:

  • Bart Farrell

Guest:

  • David Pech

Are you running PostgreSQL on Kubernetes and need to choose the right operator? In this episode, David Pech, Staff Cloud Ops Engineer, shares his experience implementing database platforms on Kubernetes and guides teams through operator selection and platform requirements.

You will learn:

  • The core requirements for a PostgreSQL platform on Kubernetes, including autopilot capabilities, security practices, and observability

  • How to evaluate PostgreSQL operators based on their architecture — from single-instance deployments to cloud-native implementations

  • What teams should consider before building their own database-as-a-service and common pitfalls to avoid

  • The distinction between being production-ready (running single instances) versus platform-ready (operating at scale with proper tooling)

Relevant links
Transcription

Bart: Plenty of folks are running stateful workloads on Kubernetes, but how are they doing it, particularly with popular databases like Postgres? Turns out that operators can be pretty important, but which one is going to work best for you? In this episode of KubeFM, our guest David Pech shared his insights about running databases on Kubernetes, specifically Postgres, including the fears of data loss and the need for operators to provide best practices and robust security features. David also highlights the growing importance of self-service capabilities for Postgres management and distinguishes between production readiness and platform readiness, emphasizing scalability and support. If something about navigating the complexities of database as a service or seeking to optimize Postgres for Kubernetes catches your attention, this episode is packed with actionable insights you won't want to miss. This episode of KubeFM is brought to you by LearnK8s. LearnK8s has been providing Kubernetes training to organizations worldwide since 2017. Courses are instructor-led, with 60% practical and 40% theoretical material, and are given online or in-person, in groups or individually. You also have access to the course material for life. For more information, check out LearnK8s.io. Now, let's get into the episode. David, welcome to KubeFM. Can you tell us a little bit about what you do and who you work for?

David: My name is David Pech, and I currently work at Wrike, a software as a service company that provides a platform for managing projects. My role is in infrastructure, where I help manage three data centers, two on-premises and one in the cloud. My responsibilities also overlap slightly with backend development, particularly in the reality sector, which I find interesting. In addition to my work at Wrike, I have a small startup called Sestra Emmy, which translates to Nurse Emmy in English. Our goal is to connect patients with their general practitioners through a secure channel, promoting e-health in the Czech Republic. We also have a presence in the Slovak market.

Bart: Fantastic. How did you get into cloud native in the first place? What was that story like?

David: With my startup, we have a track record on AWS, mostly running on AWS Lambda. I like the technology very much, particularly serverless from the start, which is very interesting. Our stack is based on Scala, a JVM language, and we had to pioneer a lot of things to have our serverless application running smoothly with minimal cold startup. We also use other AWS tools like AWS Aurora. Cloud native was very natural for me, and I have always wanted to transfer the applications I write on VMs to containers. When I was finally able to do it, I understood that we could use Docker Compose to run them, but it was not enough. We moved to Docker Swarm, and the last step was, of course, Kubernetes. This was kind of challenging because I think I discouraged everyone from Kubernetes for at least two years due to its vastness and difficulty to understand. I bought several books on the subject, and after reading about five of them, I started to consider that Kubernetes might be a way for us. So, cloud native was slightly slow or even painful for me, but currently, I'm all in for cloud native. We also run some stateful workloads, especially Postgres in Kubernetes, and we might share some details later.

Bart: Fantastic. That's great. Now, prior to this, what were you before becoming CloudNated - more information is needed to provide a link

David: Well, as a Java Enterprise Developer, I would say my best description when I started was working on business applications. I also had a detour as a manager for a while, which was not very welcome for me, but it was still a very interesting work experience. Currently, I would call myself an infrastructure engineer, and from my perspective, this is the best value I can offer on the market. I have some experience with mainframes, especially IBM's z/OS, which I worked with about two years ago for one of our customers. I also brought Oracle product code forms to a container, which was a very interesting two-week operation. I have tried to push cloud-native boundaries everywhere I could, even for products that were not meant to be cloud-native. Technically, it is possible, and with this application, it was possible to some extent. In my opinion, infrastructure is much more challenging nowadays than business development, which I used to do. As business development became routine for me, I slowly but steadily transitioned into this field, which I very much like. One thing that I thought about before cloud-native was autoscaling, which I could never do well. Typically, we over-provisioned and had many spare resources just to be ready for rush hours. Now, I am able to autoscale my applications quite well, not perfectly, but quite well.

Bart: Just on the topic of autoscaling really quickly, do you have any experience working with the CNCF project KEDA?

David: Yes, also with Knative. If you would like to talk about this a little more, we can. There are several options for autoscaling, and they are all very interesting because this is not only about the pods or creating more replicas of your pod. This is also about provisioning your nodes, for example with Karpenter or other tools, which include challenges.

Bart: Now, it's no secret that the Kubernetes ecosystem moves very quickly. How do you stay up to date? I know you mentioned some resources previously. What works best for you? Books, blogs, videos, or podcasts?

David: Personally, I try to stay up to date with Kubernetes by reading change notes for major releases. Although this is getting difficult with each version, as the topic is very vast and it's easy to miss something important. For the top projects, I try to be thorough. My most trusted resource is typically conferences and their videos, where someone can show you the basic use case, explain the decision-making process, and discuss what is useful. I also like to listen to several podcasts when I'm running or strolling with my daughter, as this is a useful tool for getting general knowledge and understanding new trends. This helps me understand if there's a new hype that I should check out.

Bart: And I saw that you mentioned you're a Kubestronaut. I think you were the first Kubestronaut we've had on KubeFM. First of all, congratulations. But for some people who don't know what that is, can you explain what it means to be a Kubestronaut?

David: The CNCF is trying to push the application part and has created around 10 different exams. Five of these exams are crucial, and if you take them, you will gain hands-on experience with Kubernetes. Taking these five exams will earn you the title of Kubestronaut, bringing you fame and recognition.

Bart: Congratulations, a lot of hard work there. Now, if you could go back in time and share one career tip with your younger self.

David: I may disappoint a few folks, but I didn't like the manager experience, especially because I was working for a large Czech company and it was very challenging mentally for me. Although it brought unique skill sets, such as soft skills and project management, I would probably advise my younger self not to do it and just be the technical guy.

Bart: As part of our monthly content discovery, we found this article series that you wrote, D-Bass in 2024, about selecting a Postgres operator for Kubernetes for your platform. You have been consulting with various organizations about Postgres on Kubernetes. What recurring challenges do you see teams facing?

David: I think that for many areas around Kubernetes, there is no single winner. Maybe we have Cilium for networking, or Vitess in the MySQL world. However, in many areas, there is no clear winner, especially in the Kubernetes Postgres area. Postgres itself is not a product, but rather like a Linux core, requiring a lot of tooling around it. The Postgres community is doing a terrific job extending it, especially in newer releases. Nevertheless, this is why it is difficult to choose, as there are many tools for running Postgres, even on VMs, and different mindsets around it. The best practices are widely different, and there are different schools of thought on how to run Postgres properly, which can lead to disagreements.

Typically, teams start by having a database in Kubernetes, but not as the first workload. This is a good approach. However, companies are often afraid to put serious data into Kubernetes. They might use Redis for caching or offload secondary storage, but they are cautious with primary data. This slows down the adoption of Kubernetes for databases.

When companies do put a database in Kubernetes, it can go one of three ways. Either the developers or DevOps department take the lead and say it's easy to do, or there's a conservative DBA who likes their VMs and high-level tools around Postgres and is skeptical about Kubernetes. The first approach can lead to problems at scale, while the second approach might mean that the company never adopts Kubernetes for their main database.

I blame companies for not having proper testing tools for their infrastructure setup, especially for Postgres. Developers are good at testing applications, but when it comes to infrastructure, testing is limited. Companies can run dozens of small databases with developers or DevOps, but they might not have any Postgres or DBA expertise. They might say they're fine with it and even run it in production until they encounter scalability issues.

On the other hand, some companies use a monolithic database with many databases on a single Postgres cluster, sharing resources. The DBA might claim this is the most cost-efficient approach, but it's an approach that's stuck in the past. Teams worry about their data and don't want to lose it or put it in jeopardy. This is why they're cautious about adopting new approaches like Kubernetes.

Bart: Given these challenges, there is a need to define clear requirements. What should teams consider when building a Postgres platform?

David: From my perspective, companies need an operator for their Postgres database because running it is not very company-specific. At the operator level, it should have some kind of autopilot that helps with provisioning the base cluster. This autopilot should incorporate best practices, including security and configuration of the cluster, such as limits on the number of clients. Networking should also be solved, either by using Kubernetes native networking with services or alternatives like PG Bouncer or pgpool.

Typically, people need to use extensions, such as PG Vector, which is useful for machine learning tasks. Another important aspect is self-service, where developer teams can easily provision and edit databases using tools like YAML. This should be a straightforward process, similar to what hyperscalers like AWS offer, such as provisioning a database clone next to the original one with minimal downtime.

Teams also want to know what's happening inside the database, especially when it's failing. Observability is crucial in Postgres, given the various levels of potential issues, from pod perspective to Postgres itself. It's essential to provide a clear overview of what's happening without requiring teams to have extensive Postgres experience.

Ideally, the same level of service and tools should be offered for development, staging, and production environments. However, achieving this can be challenging. The last aspect to consider is the maturity of the approach, ensuring the operator is production-ready. It's not worth investing time in tooling or platforms that won't make it to production. Incremental development is key, starting with a minimum viable product that can be improved upon based on feedback from the first team on board.

Bart: You mentioned operator features as one of the core requirements. What specific capabilities matter for teams that are evaluating operators, particularly those related to Custom Resource Definitions (CRD)?

David: Typically, teams look at the YAML and the Custom Resource Definitions (CRD) that operators offer. All operators offer some level of CRD nowadays, and they are different, offering different capabilities. For bootstrapping a simple cluster, every YAML I have seen is easy to understand at first sight. However, when you get to advanced topics, such as provisioning a database as a clone of the production environment, but only bringing a part of the users and databases, and then running some anonymization script, it gets complicated.

To evaluate operators, it's essential to look beyond basic "hello world" examples and consider more difficult use cases. High availability (HA) is a crucial aspect to evaluate, including parameters such as what happens when the primary port dies, when a Persistent Volume Claim (PVC) is lost, and how the operator recovers from these events. It's useful to have these metrics measured end-to-end from the perspective of the application, so you understand what will happen to your customers and what Service Level Objective (SLO) you can offer.

Many operators also offer good security practices, such as creating users and storing user credential information as secrets in Kubernetes. They can provision TLS certificates, allowing you to switch to TLS by default, and rotate certificates automatically. Observability is a different topic, with two parts: Postgres itself and the operator. The Postgres Exporter works well and is standardized, but the operator exporter is different, and you need to understand how the operator works, what conditions it checks, and what timeouts it has.

To set alerts, you need to understand what you are measuring and how the operator itself works. Postgres has different "schools" of thought on what to check and how the dashboard should look. Unfortunately, this is a vast topic. What I'm currently missing are best practices for Postgres configuration that the operator would offer. Maybe we are getting to the topic of autopilot, as many operators offer some level of autopilot. Technically, a lot of these operators can create a primary port, replicas, and reprovision them if they fail.

They can also fail over automatically if the primary port fails. This level of autopilot is high-quality in all operators. However, from the perspective of Postgres, it's completely different. The Postgres community is conservative, and they don't bump up configuration values with each new version. Many defaults for Postgres are set in a way that is not useful for running production. For example, the memory settings are not adjusted based on the available RAM, and the default settings can degrade performance easily.

When you use an operator, you may not be aware of these facts, and it can bite you significantly. I have seen consultancies promote operators as virtual DBAs, which is interesting because operators can do many things, such as installing the database, backing it up, or restoring it. However, when you get to real Postgres problems, such as a large table with bloat that needs to be VACUUM, the operator may not handle it well. There are many topics that are not currently handled well by operators, and I would like to see some evolution here. However, the Postgres community is against it, so we will see where this might go.

Bart: Now, the other pillar that you mentioned, I know you talked about this a little bit already, but the other pillar that you mentioned was self-service. Many teams struggle with this particular concept and practice. To you, what does proper self-service look like for Postgres on Kubernetes?

David: Well, I would describe it as a very easy conversation. Something like, "Hey, I want a highly available Postgres of t-shirt size L with 100 gigs of storage. Just make it happen. I don't care about anything." So now it's done, and here is your endpoint, here are your secrets, certificates, etc. This is the self-service for me. Of course, there is much more under the hood, for example, how to connect to the cluster with your application, etc. But let's say that for basic operations like cluster creation, reprovisioning of the nodes, backup and restore, basic disaster recovery, you should not have any more Postgres expertise than just adjusting the YAML. I would say something like this would be very nice. For production, it should look the same. It should not mean that you use some other tool just for production, or some of the DBAs are using psql to adjust something for production that is not available in your development cluster. This is very important because it really pushes you to try to express everything possible through the YAML and not do any manual interventions that are not scalable. If there is some UI, typically a web UI, that allows you to control the YAML, this is very welcome. Of course, even though this is a simple form, you can use Backstage for this, or something like it, if you are fancy. If you are not, I have even seen a company running 800 clusters that is using Jenkins with a custom force for their users to adjust it. So anything that your customers or developers are accustomed to is okay here, as long as they are able to make changes.

Bart: Understanding these requirements is one thing, but implementing them is probably another. Before teams commit to building their own [database as a service](what is the specific database as a service being referred to?), what critical questions should they be asking themselves?

David: I often say that there is a curse in the world where developers are eager to build their own database without thinking about the requirements and the bigger picture. At least, that's what I've seen, and I'm also to blame for doing it many times myself. It's like a typical "something as a service" scenario, in this case, Postgres as a service versus making your own decision.

The first question is, can we even use remote Postgres instead of local ones? For example, isn't there an issue with GDPR that data needs to be stored locally? Can we offload it somewhere, or do we need someone to do security approval before we start a new cluster? This might be a down-to-earth problem that you face, and you just can't do automation unless your security team approves it. So, can you make them part of the process naturally, or maybe you are not able to do DBaaS as a service at all with this?

Typically, I would require having all workloads of a similar manner. If you are running OLTP, OLAP, and some machine learning stuff, this is probably too wide for a DBaaS service. You will need Postgres experience, and you will need Kubernetes experience, and I would say a lot of it. Because everyone probably remembers how difficult it was to onboard Kubernetes and get everything right. Here, in this service, you will run only in Kubernetes, and if something goes wrong, it will go wrong at scale because you are probably not doing this to provision one database - you will have at least a few dozen of them. So, if you make any mistake, for example, a bad decision on how the API is constructed, this will definitely bite you back very significantly.

I would maybe summarize this to one simple question or several simple questions. If you would like to build your Postgres as a service for several years to come and keep investing in it next to your main products, do you intend to treat it as a product yourself and your developers as customers? And is it really feasible among all the alternatives? If you answer yes to all these questions, just go ahead and do it. But I think that, like most of us, you won't.

Bart: Fair point. Many teams might be reconsidering their approach at this point. What are the alternatives to running Postgres on Kubernetes?

David: Typically, a well-defined Postgres as a service can be used. You can opt for large cloud providers or traditional consultant companies that have their own offerings. These companies can install Postgres in your infrastructure with their parameters, typically including enterprise features and support, which may be feasible for you. There are also several projects that offer serverless Postgres as a service, which is an interesting concept, especially for low-volume webhook-type applications. I was surprised by the stability of these services. Some providers offer features like database branching, allowing you to create a new branch with the same starting point as your main branch, similar to Git, which enables easy experimentation, such as adjusting queries or tuning parameters, without affecting your main branch. This approach can even be used in production. However, it is advisable not to mix managed services with your Kubernetes offering, as this can result in two different levels of quality for Postgres. Typically, managed services are cheaper and used for staging, while production environments require more robust solutions with different parameters, especially regarding performance and long-term maintenance. If you opt for a managed service, consider using it for all workflow stages, including development and staging environments, if it is feasible from a pricing perspective. Additionally, you may want to explore PG Bouncer or pgpool for connection pooling, Patroni for high availability, or Bitnami Helm Charts for deployment.

Bart: For teams that decide to proceed with Kubernetes, there are several operator options currently available. How are these operators typically categorized, for example using Custom Resource Definitions (CRD)?

David: So, I see at least four different families of Postgres operators. The first is running a single instance, which can be done using the Bitnami Helm Charts operator. However, this approach lacks operational support and is difficult to run in the long term, making it more suitable for demos or examples.

The second family is StatefulSet, where you have one stateful set that can grow or shrink. This approach has several drawbacks, as discussed in [Jakob Scholl's talk](Unable to find specific URL) on the issues with stateful sets. The Strimzi team, which develops a tool to run Kafka, had to move away from stateful sets and instead used an operator to create pods with their own conditions.

The third family is the Patroni family. Patroni is a highly available tool that is widely used outside of Kubernetes and can be considered a de facto standard. However, it presents significant challenges because Patroni works as a Python wrapper script around Postgres, using etcd or the Kubernetes control plane as a source of truth. While Patroni can handle highly available failovers, switchovers, and replica proportioning, it can be difficult to understand the status of the pod. This approach seems to be a lift-and-shift of a pre-Kubernetes VM approach, which may not be cloud-native.

The last category is cloud-native Postgres operators. This approach uses Kubernetes primitives or databases that can run cloud-natively by default, without any middle layers. This is the most cloud-native way to run Postgres.

Bart: Now, teams often want to start small with Postgres on Kubernetes, and then things immediately snowball from there. Based on your experience, why does this approach typically fail?

David: I don't mean that starting with anything is a bad idea, especially when trying to prove a concept or create a minimum viable product (MVP). You want to see if it can run at all, so you need to start somewhere. However, what is typically a problem is if this is not a systematic effort, as it can fail easily. This can result in bad naming conventions, provisioning something that the security team does not understand or did not approve, because it was deemed a playground area. At some point, it outgrows this stage and needs to go to production, at which point you face all these issues. If you take this technology seriously and evaluate it, it needs to have a life cycle. You should be aware of where you are in the project and whether you are ready to go to production or still need security approval. This is fine with anything you try, and it's culturally dependent. If your company culture pushes projects correctly, there are typically no problems.

The problem is that you need a large amount of knowledge about Kubernetes, cloud, and Postgres, which can be difficult to obtain. Sometimes developers have simple requirements, such as wanting their Postgres to store data, but it's difficult for them to voice other significant requirements from a security, infrastructure, or high availability standpoint. If you can get these people in the same room, something good might come of it. This is a useful exercise, as it can help prevent people from becoming scared of going to production for one reason or another.

From the developer's perspective, they might say that they have only a small-scale Postgres for staging, even though they have tested it, and they are unsure how it will perform in production. They might not be able to help the operations or infrastructure department because they don't know what options are available for high availability. On the other hand, if you have a Database Administrator (DBA) who is typically conservative and likes their Patroni on VMs, they need to be part of the development process to oversee and supervise the Postgres operator clusters created for staging. They need to understand what tooling they have and what to do if something goes wrong.

There are still many myths and misconceptions, such as the idea that a container running Postgres will not have similar performance as running without it. These myths need to be dispelled, but it's a long process. DBAs are often afraid of the break/draw scenario because if the operator crashes and they only have psql to connect to it, they don't understand Kubernetes well enough to know what to do. Typically, DBAs need to understand Kubernetes to some extent. Even with an operator and help, it's a long process to get everything right.

Bart: There are numerous tutorials that promise simple Postgres deployments on Kubernetes. When I Googled, the first result was a DigitalOcean tutorial on how to run Postgres. What makes these approaches problematic?

David: I think I've hit the same issue with the tutorial. The tutorial is very interesting and looks professional, but it suggests running a Postgres cluster as a deployment of three replicas. This means sharing the data directory, which is not suitable for Postgres. You will encounter strange errors immediately. What's even more interesting is that, on the VM level, you wouldn't be able to do this because Postgres has built-in checks, such as storing the PID file. However, on containers, you are not able to use this correctly. It's probably best to use an operator and not try to do this yourself, possibly using Custom Resource Definitions (CRD).

Bart: The Bitnami Helm Chart is often seen as a middle ground between DIY and full operators. What are its practical limitations?

David: I like Bitnami as a project, and I've used many tools from them. However, when you delve deeper into the Helm Charts, it can be overwhelming due to the numerous options available. The Helm Charts are not suitable for me to bootstrap from, as they require adjustments. The primary issue lies in the basic Helm chart's day-two operations, where the chart has different defaults, such as replicas. This can lead to difficulties in keeping the cluster up and running, making it easy to encounter problems like halted pods or incorrect connections to the primary, resulting in a loss of data streaming. The cluster can become disconnected, and the only observability is through log messages. This makes the system fragile. Although it's possible to run Bitnami with a replica manager, I would suggest this as a legacy tool. Running the vanilla Bitnami chart can result in downtime, such as a primary port restart taking two to three minutes or an outage. If you restart your replica port and have only one, it takes one to two minutes. However, if you have multiple replicas, Kubernetes networking will work well, resulting in minimal downtime. Nevertheless, operators can perform better, and Postgres itself has built-in tools for failovers, which can achieve almost instant downtime, maybe under one second, if configured correctly. Therefore, I would advise against starting with Bitnami, even if you like the project in general, and instead recommend using an advanced operator, possibly utilizing Custom Resource Definitions (CRD) for better management.

Bart: Throughout this discussion, you have touched on various aspects of running Postgres at scale. Could you clarify the distinction between being production-ready and platform-ready?

David: So, production-ready basically means that you know what you are doing and can run one instance of a Postgres cluster in production for several days. This is very different from platform-ready, which means you're able to scale it. For example, running 100 instances for five years, and as a team responsible for running this at scale, you can perform all basic operations for your customers and developers. This includes upgrading to a minor version, which might be easy, and migrating to a major version, which might not be that easy because some operators do not support it yet. You need to integrate this with your DevOps tools and stack, as well as your security stack. Additionally, you will need to provide long-term support, informing people that you are stuck with a particular version due to operator connections, etc. Platform-ready means running at scale and considering what might go wrong, such as bad automation around upgrades, which could have disastrous results not just for one team or pod, but for most of your company.

Bart: Congratulations on making it this far and for providing such an in-depth amount of knowledge when it comes to writing Postgres on Kubernetes. I read somewhere that your wife wasn't convinced by the blue color of your Kubestronaut jacket. Does she like it now? Can you tell us about your journey to getting certified?

David: For sure, the blue jacket is something you should get when you become a Kubestronaut. The blue color is very vivid. My wife suggested that I wear something more elegant, or she won't go with me anywhere. I like the jacket and would wear it proudly when I get it for the certifications. I also like certifications as a learning tool because I am very productive when I set a deadline for the exam. I know I need to get all the documentation ready by that date, so I prepare thoroughly. I'm not the type to try it and see how it goes; instead, I overprepare. This is the perfect scenario for me, having a tight schedule and doing it this way. I even think I was the first person to get all the current CNCF certifications. I highly recommend them, especially those with hands-on parts, because I like the idea that there's a task to complete, and if you pass it, you'll also pass the exam. I definitely recommend this.

Bart: What's next for you?

David: My wife is delivering a baby in two or three weeks or something like this. So, probably for the next year, I have my schedule rather full.

Bart: What's the best way for people to get in touch with you?

David: I would suggest LinkedIn. Definitely this one.

Bart: It's what worked for us. I can definitely speak to that from experience. David, thank you for sharing your time, experience, and knowledge with us today. I look forward to seeing the following parts in your blog series, as I believe there are four or five more coming out.

David: Yes, hopefully.

Bart: We will link that in the show notes. Thanks for joining us on Kube FM. We will speak to you soon. Take care.

David: Take care.