Our Journey to GitOps: Migrating to ArgoCD with Zero Downtime

Oct 28, 2025

Host:

Bart Farrell

Guest:

Andrew Jeffree

This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io

Andrew Jeffree from SafetyCulture walks through their complete migration of 250+ microservices from a fragile Helm-based setup to GitOps with ArgoCD, all without any downtime. He explains how they replaced YAML configurations with a domain-specific language built in CUE, creating a better developer experience while adding stronger validation and reducing operational pain points.

You will learn:

Zero-downtime migration techniques using temporary deployments with prune-last sync options to ensure healthy services before removing legacy ones
How CUE lang improves on YAML by providing schema validation, early error detection, and a cleaner interface for developers
Human-centric platform engineering approaches that prioritize developer experience and reduce on-call burden through empathy-driven design decisions

Relevant links

Transcription

Bart: In this episode of KubeFM, I talk with Andrew Jeffree from SafetyCulture about their journey migrating hundreds of microservices to GitOps with Argo CD and without downtime. Andrew walks us through how his team replaced a fragile Helm-based setup with a domain-specific language built in CUE, giving developers a cleaner interface and stronger validation. We get into the practical challenges of moving more than 250 applications across clusters, the tricks they use to ensure zero disruption, and how automation-like approval bots reshape their operations. It's a deep dive into scaling Kubernetes safely, cutting out YAML pain, and building a developer experience that actually works at scale.

Special thanks to Testkube for sponsoring today's episode. Are flaky tests slowing your Kubernetes deployments? Testkube powers continuous cloud-native testing directly inside your Kubernetes clusters. Seamlessly integrate your existing testing tools, scale effortlessly, and release software faster and safer than ever. Stop chasing bugs in production. Start preventing them with Testkube. Check it out at testkube.io.

Now, let's get into the episode. All right, Andrew, what are three emerging Kubernetes tools that you're keeping an eye on?

Andrew: I think if I had to pick the first technology off the list, it would be Argo CD Agent. It's a multi-cluster Argo solution without the drawbacks of using a single management cluster for generation and manifests. I'm really excited about the in-place scaling upgrades in Kubernetes 1.33. Being able to dynamically adjust pod resources while it's running will be beneficial for many use cases.

The other technology I'm watching is the ambient mesh in Istio. It's not fully featured yet, but it's getting closer by the day. I'm really excited to eventually move to that.

Bart: Can you give us a quick introduction about who you are, what you do, and where you work?

Note: Since this is a direct transcript question with no technical terms that require linking, I'll keep the text as-is. However, I would link the company name SafetyCulture if the speaker mentions working there.

Andrew: I'm Andrew Jeffree, a staff cloud infrastructure engineer at SafetyCulture, based in Sydney. I work in a team that manages the entire cloud infrastructure platform, including Kubernetes deployment, CI/CD, and related technologies.

Bart: I noticed that the transcript snippet is incomplete. Could you provide the full transcript text for "How do you get into Cloud Native?" so I can properly analyze and hyperlink relevant terms?

Andrew: I used to be a Linux systems administrator for a consulting company in Sydney. We started selling more cloud solutions, and I began doing more cloud-native work. It just sort of stuck.

Bart: Previously, you mentioned the tools and how the ecosystem moves very quickly. In your experience, what's the best way to stay up to date? Do you read blogs? Do you watch videos? What works best for you?

Andrew: I tend to read a lot of blogs. I'm also active in community Slacks, discords, and various platforms. Hacker News always has something interesting. I typically pick a topic that catches my eye. Obviously, we have a passionate team internally who are constantly sharing articles they've discovered in their free time. This leads to ongoing discussions about cool new things.

Bart: If you could go back in time and share a career tip with your younger self, what would it be?

Andrew: Write more code. I started out as a systems administrator. These days, I'm pretty good at writing code, but there are times when I've looked back and thought, "That's a code problem. That's not my problem" or "I don't want to go into that. That's someone else's problem." Looking back, I could have progressed differently and solved some problems earlier.

Bart: As part of our monthly content discovery, we came across an article you wrote titled "Our Journey to GitOps: Migrating to ArgoCD was Zero Downtime". SafetyCulture mentioned running hundreds of microservices across multiple Kubernetes clusters. Can you describe your deployment infrastructure before considering any changes?

Andrew: We operate in three regions with three production clusters: one in Europe, one in America, and one in Australia. We were using Helm with a monorepo for all our Helm configurations and services, using a hierarchical configuration approach. We had default, global, and environmental/regional configs per service, all defined in YAML.

The system worked with a pipeline per cluster. Each service pipeline would trigger child pipelines to deploy specific software versions. We used a microservice Helm chart to control our template structure and maintain default standards across services, with similar approaches for CronJobs.

We also had additional pipelines to enable manual scaling after hours without requiring Git changes and approvals. However, the system was starting to become complex and was beginning to fall apart.

Bart: It sounds like this Helm-based approach worked initially, but you mentioned it became increasingly painful as you scaled. What specific incidents or pain points made you realize that you needed a different solution?

Andrew: There were multiple pain points. YAML is not the nicest language or definition. I don't think I've found anyone who really says they love it. It's not particularly user-friendly. If you work in the Kubernetes space, you understand the indentation and specs. You know an image version goes under image—the standard way we write YAML for Kubernetes definitions. But the average software developer shouldn't need to know that.

The inheritance issues were also causing problems. People would change one thing and not realize they had to modify other files. This was creating significant challenges.

We also had incidents where an engineer gets woken up at 3 am by an alarm indicating more traffic than expected. The auto-scaling hadn't worked for some reason. The focus becomes remediation, not full problem diagnosis. They would typically scale up manually through a pipeline to avoid repository review.

They would run Helm commands, adjust versions, and increase replicas. The next morning, their teammate would deploy changes to production, potentially unaware of the previous night's incident. This could lead to unresolved underlying issues or emergency configuration changes that were hastily implemented.

As we added more services and customers, these problems became increasingly problematic for everyone involved.

Bart: Given these pain points with Helm, you decided to migrate to GitOps with ArgoCD. So, some basic questions before we get into it further: Why GitOps? And why Argo and not Flux? What were you hoping to achieve with this migration?

Andrew: So, I'll call it a team effort. I was one of many people on the project and came in halfway through the actual migration. The reason we went with GitOps is we wanted a source of truth. We were already doing something pseudo-GitOps with a monorepo containing all Helm configurations, so we were halfway there in some aspects.

We looked at it and wanted a source of truth to prevent configuration drift. We didn't want to struggle with updating files or adding pipeline steps as we scale and add more Kubernetes clusters. We wanted a streamlined approach that moves away from complex pipeline management.

The goal was to have automatic reconciliation, a clear source of truth, and historical change records that make it easier to track who changed what and when. As for choosing Argo over Flux, we ultimately selected it because it had a better plugin architecture, though I'm not fully certain of the specific details since this was before my time.

We also wanted to address developer pain points. Previously, when developers merged changes, the pipeline would go green, but this only meant the configuration was validated—not that the change was actually rolled out. This caused confusion, which we aimed to resolve with our new approach.

Bart: Okay. Instead of simply moving your Helm deployments to Argo CD, you chose to implement a domain-specific language using CUE lang. Now, for folks that may not have experience, it sounds like a pretty significant investment. What drove that decision?

Andrew: We had a conversation and explored multiple approaches. We looked at keeping Helm charts, using KCL, exploring other languages, and even generating output directly from Go. Ultimately, we focused on the developer experience, aiming to create an internal platform with a clear, well-defined interface.

As we expand our platform capabilities, we want to make things more streamlined. For example, with KEDA scaling, instead of manually configuring multiple arguments, we want to simplify it to something like "I want metric-based scaling for my Kafka" and have the system handle the complex configuration behind the scenes.

In the end, we decided to adopt a domain-specific language (DSL) and chose CUE lang early on, specifically version 0.3 or 0.4. It's now on version 0.13, with version 0.14 in the release calendar. In hindsight, it was a good decision, though we did encounter challenges along the way.

Bart: You mentioned that CUE can import and validate against existing Go types and OpenAPI schemas from Kubernetes and CRDs. How has this schema validation capability prevented issues in production?

Andrew: Part of our process is having CI checks in the pipelines where we can run CUE commands to validate our CUE configurations. Since it imports schemas, it also imports validations on those schemas around requirements: required fields, regex checks, minimum values, and other constraints.

When you render a manifest using CUE, it immediately identifies if you're missing a required field or if the provided values don't match the required specifications. This means we can catch issues much earlier compared to Helm, where people typically just eyeball YAML or templated YAML, which can be complex.

With Helm, you often have to manually deploy to a cluster to verify configurations and check limits to ensure you're not over-defining resources in the Custom Resource Definition (CRD). In contrast, CUE provides native checking, allowing us to catch problems during the pull request stage.

If someone raises a PR with changes, our CI checks will prevent the merge if any validation fails. We can surface these errors much earlier, unlike in the Helm world where technically valid YAML might pass, but deployment reveals configuration issues—like incorrectly set replica counts or improperly placed fields.

Bart: With 250 to 300 applications across 20 teams, this migration sounds massive. Walk me through your strategy and why you chose a team-by-team approach.

Andrew: We have many standards around how we build services. We write most of our code in Go and use internal modules to standardize processes. This has made things easier.

With any large-scale project, you must start somewhere and have a clear plan. You can't approach it piecemeal; you need achievable goals to report progress to leadership and demonstrate value to the engineering team.

Most of our teams work primarily on their services, which makes it easier to move and determine migration strategies. This approach provides a clear developer experience and reduces cognitive complexity. Instead of having services spread across different systems like Argo and Helm, we create a consistent approach.

We focused on supporting teams by meeting with them beforehand, explaining the benefits, and addressing their concerns. Initially, there was a perception that the platform team was introducing technology for technology's sake. We countered this by highlighting practical benefits, such as addressing scaling issues and reducing middle-of-the-night service disruptions.

Our approach was to show teams how we could solve their existing pain points and improve their overall experience.

Bart: Can you explain your zero-downtime migration process that involved creating temporary deployments with a temp suffix?

Andrew: We were looking at how to mitigate configuration drift and potential human mistakes. We ended up creating a temporary deployment with a separate name. We kept the old Helm deployment and ensured they had the same labels, which meant selectors on virtual services would route traffic to the new temporary deployments.

We leveraged ArgoCD's prune last sync options on the legacy deployment. This meant ArgoCD would not remove the resource until the temporary deployment was healthy. Argo checks pod health checks, and we also monitored metrics and dashboards for each service during the process. This approach ensured that working pods would not be removed if new pods were unhealthy.

This strategy saved us multiple times, especially with services that have unique configurations or potential issues. For instance, we had about a dozen singleton services that could only run one copy due to their nature.

One of our engineers suggested implementing leader election for these services. This approach allows running multiple copies, with additional replicas sitting idle. If the original leader pod fails, another can be elected much faster than launching a new node, downloading the image, and initializing a new pod. We proactively implemented this change across all singleton services before migration.

Bart: You encountered some technical challenges during the migration, particularly with DSL coverage and ArgoCD performance. How did you handle these issues while maintaining migration momentum?

Andrew: It was hard. A lot of it involved long days and angry, rude words directed at various programming software. We have services that comply with our standards. As with every rule, there are exceptions. These exceptions are usually either older services that have been around since before time—critical linchpins to the entire platform—or services brought in through acquisitions.

In our case, it was a mix of both. We were actively bringing in services from acquisitions and integrating them into our system. This meant they needed extra configuration to work as we needed while we were still untangling things. We also had legacy services missing certain functionality, with hard-coded logic in our Helm charts—for example, configuring services based on specific service names.

Thankfully, we'd set up the application in Argo's Diff Mode, essentially with Sync turned off. This allowed us to look at Argo and see the differences in resources, ensuring we had all configurations aligned. If something was missing, we could track it down. Fortunately, it was only a small subset of issues, so it wasn't too difficult to manage.

When challenges arose, we would adjust our migration strategy—moving a team further back in the migration list, fixing configurations, updating test services to cover new features, and focusing on more standardized teams in the meantime.

Bart: So the migration delivered several measurable improvements, from automated reconciliation to environment-wide deployments. Which improvements have had the biggest impact on your team's daily operations?

Andrew: The GitOps approach, fully adopted, has been the biggest benefit to the teams. Previously, we had a pseudo GitOps approach that wasn't tracking versions, which meant we lacked a disaster recovery aspect. If we spun up a new cluster or needed to recover a cluster, we'd have to re-trigger every service pipeline to get deployments and versions in place—a huge amount of work.

Moving to this new model with GitOps and the DSL, we can roll out platform-wide changes at the environmental level or across the entire platform safely, in a more structured manner. We can ensure changes are applied when we want them, instead of waiting for teams to progressively redeploy services or manually redeploying every service to implement new changes. This has been a significant time-saver, and having a history of changes is invaluable.

Bart: The article mentions that you developed automated approval workflows, including bots that review scaling adjustments. Can you explain how this automation has changed your operational model?

Andrew: In the initial move to DSL, the first team encountered a scaling issue that required middle-of-the-night approvals. They realized that waking up two people to scale a service was inefficient.

To address this, they developed a GitHub bot solution. The bot would vet changes using CUE validation, checking against predefined schemas to ensure changes meet required minimums and maximums. The bot would verify:

Changes are from an approved author
Only one service is being modified
Changes are for a specific environment or region

With these checks, the bot could automatically approve the PR, allowing developers to merge changes in the middle of the night without waking up team members. All actions would be logged for future reference.

Bart: You've shared several valuable lessons, including respect muscle memory and plan for scale from the beginning. What advice would you give to teams considering a similar migration?

Andrew: In regards to muscle memory, it's really about the interface you provide to developers. We did shift the interface significantly by moving from YAML definitions to CUE. I'd talk more about the tooling in this case.

Previously, all the runbooks involved going to a specific pipeline and performing certain actions—it was muscle memory for people after years of doing it. Now, they go into our Internal Developer Platform (IDP), click a button, fill in a form, and it scales and changes across all our environments.

In many cases, we had to update our tools to be smarter and more realistic. We added checks like: Is this tool migrated? If so, we provide clear guidance, saying, "That's not the way to do it. Go here," or better yet, do it for them the new way and inform them, "FYI, this is the new way. We've done it for you this time, but we don't promise this will continue to work."

It's about minimizing pain in the transition. Our platform was fundamentally built with bash scripts, which can be challenging to write, especially as complexity increases. These scripts almost became their own applications. We had to decide: Do we continue investing effort in this approach, or do we push people towards new patterns?

It's a balance of finding middle ground—sometimes it's easier to absorb the cost of maintaining these scripts and slowly rewrite them into proper languages and tools. This allows teams to continue having a central, consistent interface instead of a complete mindset shift.

Over time, we're gradually eliminating older interfaces to reduce technical debt. Ultimately, it's a developer experience problem. You need to treat developers as your customers and ensure their satisfaction.

Bart: Looking forward, you mentioned implementing Argo Rollouts and evolving toward a more pure GitOps approach. The article also briefly mentions using pre-sync hooks for database migrations. How do these fit into your future GitOps workflow?

Andrew: Part of the separation of CI and CD is where the project starts. The holy grail of DevOps platform engineering is that separation of state. The long-term vision is that build pipelines should build the image, run tests (unit testing, etc.), check it, and after the pipeline is green, deployment is no longer the developer's problem.

To achieve this, there is a need for safety and surety. Previously, teams were using one of our clusters as a canary cluster to allow promotion between regions in the production environment. When we moved to Argo, we removed that option and mandated environmental deployment. Teams must now use feature flagging and proper risk management techniques, validating things earlier.

We looked at how to leverage these patterns and enable teams to deploy safely, especially for complex services like identity services that every other service depends on. We implemented Argo Rollouts to provide proper canary deployments within clusters, using metric promotion to evaluate service health. Teams can now define custom metrics for HTTP endpoints and determine when a service is ready to promote.

Regarding pre-sync operations, we recognized that running database migrations in CI requires production environment access, which is a security risk. We moved database migrations to pre-sync hooks, ensuring that if migrations fail, the service does not deploy. This approach has worked well, with some initial challenges around observability and surfacing the right information to developers.

Bart: And Andrew, for teams out there that are currently struggling with similar Helm-based deployment challenges at scale, what would be your key takeaway from this entire migration experience?

Andrew: It's tough. It really depends on where you are in the maturity cycle of your deployments. Helm has its place, but I wouldn't adopt it personally if I went to a new organization. I know they're talking about a major rework in Helm v4, though I've not looked into it in depth to see what improvements might change my mind.

I would err on the side of building a platform with something out of the box or more templated. Timoni (I think that's how it's pronounced) comes to mind—another tool written in CUE that's doing something very similar to what we've done internally. It uses a domain-specific language to build your platform, allowing you to migrate with a clear plan, have a robust testing strategy, properly scale, validate everything, and create clear documentation in advance for your users.

Bart: One of the things for me in this conversation is repeatedly mentioning: if this problem happens, how many people have to wake up? We're really building human-based empathy that, at the end of the day, we choose a solution, but what's that going to involve in terms of human effort necessary to move things forward?

What do you think is causing some organizations to not see that, and what can be done to have a more human-centric approach of understanding that, as much as we're talking about very technical things, we're ultimately talking about quality of life for people?

I've been working in companies where DevOps teams take an extra phone with them to have OpsGenie available 24-7, 365. Some people have children and are waking up in the middle of the night to address incidents. I've seen co-workers have to take out a laptop in a gym or at a company dinner.

What piece of advice would you have for teams that want to build this more into their strategy? What are the things they should consider?

Andrew: A core building block is empathy. You should be on call and available at the escalation point. As much as no one wants to get woken up, the key is understanding that people are doing their best not to trigger problems they have the power to fix.

As a service provider, you should ensure the service is as stable as possible and that users can self-service fixes when needed. You need to learn from incidents, attend postmortems, and work in partnership—not as a dictator. It's about making friends, not enemies.

Think of it like a product team: engage with your co-workers, do user research, and show them potential solutions. Ask for their perspective because you can only understand so much from your own lens. Involve them frequently and don't be afraid to ship iteratively.

Start with small steps. Share a roadmap that shows progression. For example, begin with manual rollout promotions to build confidence in the underlying technology. Then sit down and ask: What are you doing in these manual steps? What are you validating? What are your concerns? What tasks are painful? How can we automate and improve these processes?

Bart: Andrew, what's next for you?

Andrew: I'm working on a bunch of projects as a staff engineer. I don't usually get to focus on just one thing. I'm currently rebuilding our development environment. We've got over 60 namespaces in our development environment, and each has a copy of every service. This means it's actually a bigger cluster than any of our production clusters and costs more money. It's not really production-like due to its nature, so it ends up being quite clunky and painful. I'm in the process of rebuilding it into a new model where we have a single namespace and deploy only the dependent services someone is testing, using header-based routing for them.

Bart: Andrew Jeffree can likely be contacted through SafetyCulture's official channels or professional networking platforms like LinkedIn. He might also be reachable through community platforms like CNCF Slacks where many cloud native professionals network.

Andrew: The best method is probably LinkedIn. As much as it's the bane of everyone's existence, it's a good filtering tool. If you send me a message, I'm usually happy to respond or connect with me if you can't send a message. I don't bite. I'm on various community Slacks as well. If you're on the CNCF Slacks and others, you'll probably find me there if you feel free to send me a message.

Bart: Fantastic. Well, Andrew, thank you so much for sharing your knowledge and experience with us today. I look forward to talking to you in the future. Take care.

No worries. Thank you for having me.

Listen anywhere