Patroni Backups: when pgBackRest and Argo CD have your back (literally)
Feb 3, 2026
This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
Your database backup strategy shouldn't be the thing that takes your production systems down.
Ziv Yatzik manages 600+ Postgres clusters in a closed network environment with no public cloud. After existing backup solutions proved unreliable — causing downtime when disks filled up — his team built a new architecture using pgBackRest, Argo CD, and Kubernetes CronJobs.
In this episode:
Why storing WAL files on shared NAS storage prevents backup failures from cascading into database outages
How GitOps with Argo CD lets them manage backups for hundreds of clusters by adding a single YAML file
The Ansible + Kubernetes hybrid approach that keeps VM-based Patroni clusters in sync with Kubernetes-orchestrated backups
A practical blueprint for making database backups boring, reliable, and safe.
Relevant links
Transcription
Bart Farrell: How to design database backups that won't take your systems down when something goes wrong. Today on KubeFM, we're joined by Ziv Yatzik, a senior DevOps engineer working in a closed network environment with no access to public cloud providers. In this episode, Ziv breaks down how his team designs and operates backups for hundreds of Postgres clusters running Patroni, focusing on point-in-time recovery, failure handling during switchovers, and why many existing backup solutions fall short in production. We look at the architecture behind the system. shared NAS storage for WAL files, pg BackRest, Ansible-driven automation, and a GitOps workflow using Argo CD to manage backups at scale through Kubernetes. This is an experience-driven discussion on making database backups boring, reliable, and safe, even in highly constrained, high-stakes environments. This episode of Kube FM is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through Kubernetes courses. Courses are instructor-led and are 60% practical, 40% theoretical. Students have- access to course materials for the rest of their lives. They are given in person and online to groups as well as individuals. For more information about how you can level up, go to learnkube.com. Now let's get into the episode with Ziv. All right, Ziv. So what are three emerging Kubernetes tools that you are keeping an eye on?
Ziv Yatzik: Well, I really like Argo CD. I use it a lot in my work. I can deploy whatever I like on Kubernetes and I can work using GitOps methodology which is the best methodology I've ever met. I worked a lot by doing many mistakes and when I met this way of work to use Git for everything I deploy, I just started getting better and better and less mistakes and be more organized. That's one tool I use a lot. Another tool I used like two months ago for the first time is a k9s. I like the CLI, it is more flexible than GUI and it gives a lot of flexibility. And the third one I really like to use is ScaleOps. I use it to optimize the requests and limits of every customer in my Kubernetes cluster. That's how we manage CPU and RAM more organized and better and more cost efficient.
Bart Farrell: So Ziv, tell us more about what you do.
Ziv Yatzik: I work for a company that provides infrastructure within a closed network environment without relying on any public cloud providers like AWS, Azure. And I've been part of the database DevOps team. which focuses on database-related infrastructure and automations. And over the time, I moved from being a team member to leading the teams. And now my new role is DevOps, a senior DevOps in our organization, not focused on database-specific.
Bart Farrell: Okay. And Ziv, how did you get into Cloud Native?
Ziv Yatzik: I was given the opportunity to take a six-month course that introduced me to world of full stack development. During the course, I became particularly interested in the DBA fields and decided to dive deeper into the database technologies. The company that sponsored my training and guaranteed the job placement after the course didn't have an open DBA position at the time. the closest role available was the one I eventually took, being partly DBA and partly DevOps. So after completing the six months full-stack program, I transitioned into a database DevOps role, where I began my first role in the high-tech industry as a DBA and DevOps engineer.
Bart Farrell: And now, Ziv, the Kubernetes ecosystem, it moves very quickly. It's very fast-paced. How do you stay up to date?
Ziv Yatzik: I'm trying to keep up by reading a lot of articles, exploring new tools, understanding how they can be applied and help my organization. I don't want to know only what we do today and use all the tools that might work, but it can be better. So what I do is try all the time to use new tools, ask people. from the industry because we work in a closed environment. I don't know, maybe sometimes we missed a tool that exists and people use. Let's say Argo CD, I heard about it in a meetup in Tel Aviv the first time like two years ago. I started working with this tool and it just showed me a new world of DevOps and opportunities, how to make all of our organization better. Most of the time when I reach to a new tool that can help me, it's because there is any pain points in my organization. If there is any tool that is not good enough, that we have a lot of problems with it, I get calls each night. I know that there must be a change. That's when we start digging and looking for a new tool that might be better and newer with a better architecture. And that's how we really keep up with the industry.
Bart Farrell: And if you could go back in time and share one career tip with your younger self, what would it be?
Ziv Yatzik: I don't really have any specific tips. I would give... to my past self. I truly believe in the saying everything that goes wrong happens for the best. There is a reason for everything that happens and every mistake I made along the way led me to where I am today and I'm grateful for where I am today. So just keep doing mistakes, keep getting better, keep digging and looking for a better solution. That's it.
Bart Farrell: As part of our monthly content discovery, we found an article that you wrote titled Patroni Backups, When pg Backup REST and Argo CD Have Your Back Literally. So we want to look at this a little bit more in detail with the questions that we're going to share. So before we dive in into the technical details, let's set the stage. Can you tell us about the infrastructure you're working with, the scale, the workloads, and what kind of environment we're talking about here?
Ziv Yatzik: Yeah, sure. In our organization, we manage around 600 database clusters across multiple technologies and environments. From development, integration, pre-production, and production. It's a large-scale setup that requires strong automation, monitoring, DevOps practices to keep everything consistent and reliable.
Bart Farrell: You're running Postgres on Patroni clusters for high availability, but backups became a major pain point. What made point-in-time recovery so difficult in this setup?
Ziv Yatzik: When you work with reliable and highly available databases, the architecture naturally becomes more complex. One of our main requirements was to support point-in-time recovery for databases. And in PostgreSQL, with Patroni, we needed to copy every WAL file to separate storage to keep timeline and be able to restore DB to any second if needed. After a switchover, we had to make sure we continued collecting WALs from the new primary. And we chose pgBackRest because it integrates perfectly with Patroni and gives us stable, consistent backups. Other pay tools we tested added too much complexity and could affect database availability. What we don't want happened, so we prefer to build the solution ourselves using existing tools. keeping things reliable and avoiding unnecessary risks.
Bart Farrell: You looked at existing backup solutions, both commercial and open source. What were the specific gaps you discovered that made you decide to build your own solution?
Ziv Yatzik: Well, most of the tools we tested were not compatible with Patroni at all. Some companies even built new solutions specifically for clients using Patroni, but they were very buggy and overall architecture was fragile. Their approach required having a local disk on each cluster member where Postgres would send the WAL files. The same disk had to be mounted on all servers, so after a switchover the new primary could continue writing what was there. The problem was that if the disk filled up before backup ran, the Postgres instance would go down. It was the best solution we found at the time, but it was far from ideal. We kept getting alerts about database instances going down. I had a strong feeling there had to be a cleaner, simpler, and more reliable solution. Point-in-time recovery is not a core database feature and didn't make sense to me that we should experience downtime just to get point-in-time recovery available.
Bart Farrell: Your solution is built on several layers, starting with automation. Why did you choose Ansible as the foundation, and what does it actually automate in your backup architecture?
Ziv Yatzik: My team provides databases as a service when a new database is requested. Our Ansible automations deploy all components, Patroni, HAProxy, PostgreSQL, its etcd and keepalived, with the required configurations across the VMs. So for our new backup architectures, we mounted a large NAS storage on every server and updated the PostgreSQL archive command to store WAL files there. Then, Ansible applied all the needed configuration on the servers and ensures everything is idempotent so we can safely rerun everything at any time. In a separate step, Ansible also pushes a new commit to our Git repository, adding the new consumer for backup management. And once that's done, we simply sync Argo CD, which applies all the necessary objects and configuration in Kubernetes, leaving us with a clean, consistent setup and no leftovers from the old solution. So this process is also part of every new installation. The customer just needs to specify a backup tool when we ask him, and the automation runs as part of the standard deployment.
Bart Farrell: One of the key architectural decisions was storing WAL files on shared NAS storage rather than locally. This might seem like a simple choice, but what... problem does this actually solve in a Patroni environment?
Ziv Yatzik: In Patroni, we face the challenge of keeping a consistent backup timeline during switchovers. pgBackRest turned out to be the best tool for this because it can be used directly in the archive command, sending WAL files anywhere you want while preserving the correct timelines. So we decided to use a single large NAS storage for all clusters across the company. This way, When a switchover happens, the new primary already has the same mount point and continues sending WALs to the same location using the pgBackRest CLI, keeping the timeline fully consistent. Then came another challenge. So what if the NAS storage fills up? The pgBackRest command inside the archive process is designed to send a success signal to Postgres even in the upload phase. so the database doesn't go down no matter what. It logs the error instead of sending a fail. And our exporter triggers an alert that we can see in Grafana and also wherever we store alerts. Each environment has its own NAS storage. During early testing, our dev environment filled up and it stayed full for three days, but not a single database went down. That was amazing. We couldn't reach such a thing before and it was by mistake but still it showed that everything works and we can skip backups and not affect any database. So of course in all other environments we now actively monitor the NAS after we did this mistake and receive alerts long before it gets to any... to around 90%.
Bart Farrell: Now, let's talk about the Kubernetes side of things. You're using Helm charts with Argo CD to manage backups. For folks who might not be familiar with this pattern, how does GitOps change the way you handle database backups?
Ziv Yatzik: So before I became the team leader in my team, all company backups were managed by a single VM that ran a nightly clone job using a text file with all clusters. If that VM went down, there were no backups at all, and the storage could fill up, taking databases offline. It was a far-dry setup and honestly pretty scary. I mean, we got a lot of calls when this VM was down. So I suggest moving everything to Kubernetes. We built a new backup system using Argo CD and Helm charts. The chart includes a config map with all the settings. cluster names, backup titles, destinations, and more, and creates two cron jobs, one for full backups and one for incremental. Each job runs separately for each Patroni cluster, and when we add a new cluster, we just create a values.yml file in the Git repository. When we add a new cluster, we just create a values.yml in our git repository and Argo CD detected and automatically deploys a new application for that cluster backups. Any change in Git triggers Argo to sync the actual state with what's defined in the repo. After our Ansible automation prepares the servers and pushes configs to Git, I can trigger backups anytime or even disable them by setting backups false. Now managing backups for hundreds or even thousands of clusters feels exactly the same. It's all automated, clean, and version controlled. And now we can provide to whoever wants so many databases they want.
Bart Farrell: Your PG-backrest configuration uses some interesting optimizations, running backups with up to eight parallel processes and enabling backups from standby servers. What's the thinking behind these specific configurations?
Ziv Yatzik: pgBackRest can use parallel processing, meaning it can run a backup using multiple CPUs instead of just one. This makes backups much faster and more efficient, especially on systems with several cores. It also supports standby backups, so we can take backups from a replica node instead of the primary, reducing the load on the production databases. We can also adjust compression levels and use asynchronous archiving to optimize performance in our throughput environments. I remember when we tried to find the best compression level, balancing storage space and backup time, several types of backups on one terabyte database. We found that the medium compression level was the best. It didn't take too much storage and it avoided the 10 to 11 hours backup, which is way too long than needed to be. And that was the best way we could... manage our time and storage, that it wasn't too much storage and wasn't too much time for each backup.
Bart Farrell: You mentioned that the only change needed in Patroni itself is configuring how WAL files get archived. That seems surprisingly minimal. Walk us through what's actually happening in that integration.
Ziv Yatzik: When a WAL file is completed, Postgres triggers the archive command. And now I set up. Patroni defines this command to use pgBackRest, which takes the WAL file and pushes it to the shared NAS storage. Every node knows exactly where to send it to its walls, and after a switchover, the new primary keeps writing to the same location. The best part is that changing the archive command doesn't require the database to restart, so we applied it safely. No downtimes and no impact on any database in our network. We just... With this single configuration, the whole backup system stays connected and reliable, simple by design to minimize risk during the deployment.
Bart Farrell: Observability is a huge part of your solution with Prometheus and Grafana. What metrics are you actually tracking? And what kind of alerts have proven most valuable in production?
Ziv Yatzik: We collect metrics using the pgBackRest exporter on the Patroni nodes themselves. we also have metrics. that we collect from our Kubernetes cluster that we can see if a cron job fails. So using those metrics, we know if a job is failing. We also got logs of every failed job that is on the container. We are going to send it to Splunk and monitor their logs also in the future. And we can use metrics for Grafana dashboards and alerts to see when was the last backup. how long it took, sizes of everything and more.
Bart Farrell: Let's talk a little bit more about the multi-node setup. Your configuration defines three Postgres nodes. How does pgBackRest decide which node to back up from? And what happens during a failover?
Ziv Yatzik: We use standby tool parameter to make pgBackRest backup from standby only. It checks if the server is connected to a standby or primary database. And if it is a standby, it will just start the backup. If not, it will switch the connection to another instance and will check if it's a standby or not. We do it because we don't want to load on the primary backup. Wait, let me repeat this one. We do it because we don't want to load on the primary. Backup requires a lot of reads from the disk, and we don't want to... We don't want to hide this guy. Doing failover, there are no followings because... All WALs are in the shared NAS and every instance can reach to this folder and add new WALs to it.
Bart Farrell: One of your bold claims is that you couldn't find any existing solution, free or paid, that handled this as comprehensively. What makes this approach genuinely different from commercial backup solutions?
Ziv Yatzik: All of these solutions in the market used old-fashioned approaches. They are not using modern DevOps. oriented methods that are distributed, monitored, easy to implement and safe for the database. Most DBAs today don't combine their problem solving with the new DevOps tools or any innovative mindset like this and many are still stuck in the past. This is why I published this backup solution I've heard that even in large companies, let me repeat this one, I've heard that even in large companies, people still work in a way that they put their databases at risk, or they provide low quality backups just to avoid danger to the database. I wanted to show that you can do both, provide point-in-time recovery and still protect the database and not affect it.
Bart Farrell: Looking at the broader picture, you mentioned this architecture can serve as a blueprint. for other database backup scenarios beyond Postgres. How would someone adapt this for, say, MongoDB or another database system?
Ziv Yatzik: Everything in this architecture is easy to adapt to other tools. We already finished modifying it to backup MongoDB also in our company, and it is already running in production. The only change here is the image of the cron job backups that you need to change. and the script of the backup itself. And if willing to also install using Ansible, you need to modify the installation for each backup architecture. On MongoDB, we have only dumps and not point-in-time recovery for now. So we didn't try to do it yet. Point-in-time recovery is also for enterprise only. So we are talking with MongoDB these days to... to get licensed and then we will try to do point in time recovery on source.
Bart Farrell: If a team wanted to implement this solution, what would you say is the most challenging part and where should they focus their initial efforts?
Ziv Yatzik: Every team that wants to implement this solution must know the database they want to backup and must know the backup tool they want to use. Our solution is the orchestrator for all backups. It's not the technology that creates the backup itself. After knowing all of that, you need to learn GitOps a little bit, Ansible, and you got to go.
Bart Farrell: Okay. You ended the article with backups are like coffee, essential for surviving those unexpected moments. Beyond the humor, what's your final advice for teams managing critical databases in production?
Ziv Yatzik: My best advice for... every production database to keep working is not touching it because humans make mistakes all the time. So as a human, I want to touch my database as less as possible to prevent those human errors, but I don't want to not maintain my database. So I make everything automatic. It includes backups, installation, monitoring, configuration management, and whatever I need for my database. Sometimes when you must do something by yourself on the database, you want someone else next to you to double check everything, to approve to you, to run any command you want to run, and be super organized and create a list of commands before you run them, and explain every command, why you run it, to this person that double check you. That's how we do it and we prevent a lot of human errors. and a lot of mistakes.
Bart Farrell: Ziv, what's next for you?
Ziv Yatzik: I'm moving now into a broader field beyond databases. I'm learning new approaches and new tools to work in a better and more efficient way. And I'm implementing them in my organization now, like I did to databases and to backing up databases. Now I want to do it to more fields that I don't know yet. And I just moved to this role, so... I'm very new.
Bart Farrell: And if people want to get in touch with you, what's the best way to do that?
Ziv Yatzik: On email or LinkedIn.
Bart Farrell: Fantastic. Thank you so much for joining us and for sharing your knowledge with our community. Really appreciate all the effort you put into your article and wish you nothing but the best in the future.
Ziv Yatzik: Take care. Thank you.
