Kubernetes upgrades: beyond the one-click update
Host:
- Bart Farrell
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
Discover how Adevinta manages Kubernetes upgrades at scale in this episode with Tanat Lokejaroenlarb. Tanat shares his team's journey from time-consuming blue-green deployments to efficient in-place upgrades for their multi-tenant Kubernetes platform SHIP, detailing the engineering decisions and operational challenges they overcame.
You will learn:
How to transition from blue-green to in-place Kubernetes upgrades while maintaining service reliability
Techniques for tracking and addressing API deprecations using tools like Pluto and Kube-no-trouble
Strategies for minimizing SLO impact during node rebuilds through serialized approaches and proper PDB configuration
Why a phased upgrade approach with "cluster waves" provides safer production deployments even with thorough testing
Relevant links
Transcription
Bart: In this episode of KubeFM, Tanat discusses the engineering decisions and operational challenges behind managing SHIP, Adevinta's internal multi-tenant Kubernetes platform we spoke about in a previous episode. Tanat walks us through their shift from blue-green to in-place upgrades on EKS, the impact of cluster scale, and how they optimize for SLOs during full node pool rebuilds.
We explore how his team built custom dashboards to surface deprecated APIs using tools like Pluto and Prometheus, enabling safer upgrade cycles across teams. Tanat also shares hard-earned lessons on configuring pod disruption budgets, warming up ELBs in advance, and using canary clusters to mitigate production risk.
The conversation also covers emerging tooling like vCluster for multi-tenancy, Karpenter for cost-efficient autoscaling, and the use of eBPF via Cilium and Hubble. If you're responsible for Kubernetes operations or platform stability at scale, this episode offers pragmatic strategies rooted in real production experience.
This episode of KubeFM is sponsored by Learnk8s. Since 2017, Learnk8s has helped engineers all over the world level up through in-person and online courses. Courses are instructor-led and can be taught to individuals as well as to groups. Students have access to the course materials for life, and courses are 60% practical and 40% theoretical. For more information, check out Learnk8s.io.
Now, let's get into the episode. Well, Tanat, welcome to KubeFM. What are three emerging Kubernetes tools that you're keeping an eye on?
Tanat: The first technology I'm keeping an eye on is KRO, a new emerging open-source project from AWS. It stands for Kube Resource Orchestrator. As a platform engineer, I see great benefits from using KRO. For example, it provides an easy abstraction layer on top of Kubernetes APIs and allows defining dependencies among them. This could be super useful to simplify platform operations and enhance the developer experience.
The second technology is vCluster or virtual cluster. As someone running multi-tenant Kubernetes, we often face challenges where people rely on the same CRD inside the same clusters. When we want to upgrade the version for one, it conflicts with another. vCluster can provide an isolated view of Kubernetes for each tenant and can help speed up development by quickly spinning up new development environments.
Lastly, I'm interested in anything related to eBPF. I'm personally using Istio in Ambient mode and keep an eye on technologies like Retina, Cilium, and Hubble in this space.
Bart: Now, just for the sake of introduction, who are you? What do you do? Where do you work?
Note: While this transcript snippet doesn't contain any specific technical terms that warrant hyperlinks, it appears to be a standard introductory question from the host Bart Farrell to the speaker Tanat Lokejaroenlarb from Adevinta.
Tanat: My name is Tanat. I'm originally from Bangkok, Thailand, but now I'm living and working in Barcelona, Spain, with a company called Adevinta. We develop digital marketplaces around Europe.
Bart: And how did you get into cloud native?
Tanat: I started my career as a software engineer. Around 2017, my company began a proof-of-concept project exploring cloud native applications and twelve-factor principles. I volunteered to participate. The first platform I worked with was Red Hat OpenShift. I still remember the first out-of-memory (OOM) error that kickstarted my career in cloud native technologies. After that, I was introduced to Terraform, Kubernetes, and other related technologies.
Bart: Now, the Kubernetes ecosystem moves very quickly. How do you stay updated? What resources do you use?
Tanat: I think it's almost impossible to keep up with such a fast-paced industry. But most of my go-to sources are my LinkedIn network. I try to connect with a lot of people. When they share or react to something, I see interesting things come up. Those are my sources. Also, I have a few favorite sources such as KubeFM, Kubernetes learning platforms, and publications like SRE Weekly and Pragmatic Engineer, which are very good.
Bart: Thanks for the compliment. If you could go back in time and share one career tip with your younger self, what would it be?
Tanat: Blog posts can be an excellent way to deepen understanding. When you want to write about something, you have to understand it clearly. You have to contemplate it and think about the core idea of what you want to write. This helps me learn a lot about the topic I want to write about. Additionally, it provides exposure by sharing blog posts. That's my recommendation.
Bart: As part of our monthly content discovery, we found an article you wrote titled "Unpacking the Complexities of Kubernetes Upgrades Beyond the One-Click Update". We're going to talk about this in more depth. It's also a subject that's come up on other podcast episodes, so we know there's a fair amount of interest in it. Can you tell us about the Kubernetes platform you run at Adevinta and what your team is responsible for?
Tanat: The platform I'm working with is called SHIP. It was mentioned in a previous episode with some of my colleagues. To refocus, SHIP is the internal Kubernetes platform as a service, designed as multi-tenancy by default. This means that many developer teams share the same underlying Kubernetes clusters, but they have their own isolated namespaces to work with. My team is responsible for maintaining SHIP, including upgrades, feature additions, and on-call support.
Around three years ago, we decided to transition from using kube-aws, a command-line tool for provisioning Kubernetes clusters on AWS, to a managed service like EKS. Two main drivers prompted this change:
First, KubeAWS was being deprecated and had stopped at Kubernetes 1.15, while the upstream had already progressed to versions 1.19 or 1.20. This meant we were missing critical features and security patches like volume snapshots and ingress class. Additionally, many cluster components stopped supporting older versions, leaving us stuck with outdated components.
Second, we wanted to shift our team's focus toward higher layers of the stack, providing more features to SHIP users instead of managing infrastructure. With self-managed Kubernetes, we had to handle everything from creating etcd clusters to managing API servers. We experienced critical incidents involving etcd health and unstable API servers, which consumed significant time and resources.
By moving to EKS, we allowed AWS to manage the underlying infrastructure, enabling our team to focus on building on top of SHIP. While managing infrastructure was challenging, the experience of troubleshooting etcd and API server issues remains valuable, providing insights I can apply when problems arise.
Bart: Cluster upgrades seem to be a significant challenge at your scale. What problems were you facing with upgrades, and what strategy did you initially use to handle them?
Tanat: In the early days when we had just a handful of SHIP users, we used a strategy called blue-green upgrade. This meant provisioning a brand new, empty cluster and loading it with the updated version. We worked with the development team so they could start deploying the same workload in the new cluster and ensure everything worked before gradually shifting traffic from the old cluster to the new cluster—starting with five percent, then ten percent, and continuing incrementally.
This was a safe method because the team could first verify if the application could work properly with the new version. However, the problem emerged as our user base grew, making this approach unsustainable. Each migration required high coordination between our team and our users. We had to schedule maintenance windows with every team, pre-create new clusters, and work hand in hand to slowly shift traffic.
These migrations normally took months to complete and consumed an entire quarter just to perform one upgrade. The situation became more challenging when we moved to EKS, as EKS versions come with an expiration date—each version is supported for only 13 months before becoming end-of-life. This added pressure, forcing us to constantly run against the clock.
Bart: You mentioned that scale became a significant challenge in your upgrade process. What specific scaling issues did you encounter when migrating between clusters?
Tanat: One of the hidden complexities of the blue-green approach is what we call a scale mismatch. I think of every Kubernetes cluster as a living organism or living system that evolves over time through auto-scaling and organic traffic growth. When you spin up a brand new cluster, it starts empty, while the existing one already has mature traffic and workload optimized for real-world conditions.
For the new cluster to mature to the same stage, it requires multiple auto-scaling cycles and restarts, which introduces migration noise for users. For example, ingress controllers in the old cluster are ready and scaled to handle production workloads, while the new one needs time to warm up. Similarly, Prometheus in the existing cluster might be scaled based on the number of pods and objects, potentially using up to 80 gigabytes of memory, compared to just a few gigabytes in the new cluster.
The real problem is that even cloud providers need time to warm up. I remember having to submit a request to AWS to warm up elastic load balancers in advance before migration to ensure they could handle the same traffic as the old clusters.
Additionally, some applications are not designed to be moved this way. Stateful applications relying on persistent volumes bound to a specific cluster require downtime. To migrate such applications, you must stop traffic in the old cluster, perform a disaster recovery, backup, restore to the new cluster, and then shift traffic.
With these two main problems, it's almost impossible to do this migration in a sustainable way.
Bart: This led you to investigate in-place upgrades with EKS. Many might think that this is just running a single command, but you found that it's more complex. What were your main concerns when considering in-place upgrades?
Tanat: Many people may have the perception that an in-place upgrade in EKS is a default behavior—just a click in the UI or a single command line. However, there are many complexities behind this process.
There are two main concerns when doing an in-place upgrade that must be considered. The first is the deprecation and removal of APIs that could break existing workloads once you upgrade. This process is irreversible, so once you upgrade, you cannot go back.
Another significant issue is that an in-place upgrade requires replacing all existing node pools or nodes to update to the new version. What we call a "full cluster rebuild" generally causes problems, potentially involving customer downtime or service degradation.
We decided to first examine the node rebuild scenario because it not only affects cluster upgrades but also impacts any operation requiring a full node rebuild—such as changing the AMI version of a cluster, which necessitates replacing every node. Investigating this issue benefits both scenarios.
Bart: You ran some interesting experiments to understand the impact of node rebuilds on your service-level objectives (SLOs). What strategies did you test, and did you consider using Karpenter for this?
Tanat: When we were doing this, we were looking for a benchmark on how impactful each node review is. The most obvious indicator we could think of is our SLOs, because they are actually a contract with our users. If they are impacted, we know our customer is impacted.
We tried two strategies to evaluate the impact on SLOs during the review. The first strategy was a non-serialized approach, which means all nodes are drained simultaneously and cordoned, preventing new workloads from being placed. Then we spin up a new set of nodes with a new version, wait until this new pool is ready, and terminate the previous nodes once the new ones are ready. This approach is very fast, destroying many nodes at the same time. If I remember correctly, it took about half an hour for a small cluster. However, the impact on SLOs was detrimental because all nodes were being destroyed simultaneously.
The second strategy is more safe: a serialized approach where only one node pool in the cluster goes through the process at a time—getting cordoned, recreated, and then destroyed. The result is quite safe, with less impact on SLOs. The problem with this approach is that it takes more time. If a customer has many node pools, it could take hours to finish. We ultimately decided on the serialized approach because we prefer minimal user impact over faster upgrade times.
By the end of the experiment, we analyzed the data and identified affected SLOs, which mostly included ingress availability, ingress latency, locking latency, and DNS. With this data, we investigated each impacted component underlying the SLOs. For ingress SLOs, we examined the NGINX ingress controller and fine-tuned its configuration—configuring proper graceful shutdown periods, setting sufficient pod disruption budgets, and ensuring minimum replicas to minimize impact during scale-up.
After fine-tuning these components and re-running the process, we found significant improvements in SLO impact.
Regarding Karpenter, at the time I wrote the blog post, it was not as mature as today. We were just exploring it but didn't use it. Now, we are fully utilizing Karpenter. I wrote another blog post called "The Karpenter Effect" that covers everything related to this scenario, which I recommend checking out.
Bart: For many Kubernetes administrators, a major concern with version upgrades is API deprecation. How did you approach the challenge of ensuring your customers' workloads wouldn't break after an upgrade?
Tanat: There are two main fundamental things to take into account. The first is the deprecation of the resource set itself, which means the resources inside the cluster that will be deprecated or removed when upgrading to the next version. Another aspect is the API calls that the application is using to call deprecated APIs.
Our approach to address this concern is to gain insights about these deprecations in the form of metrics. As a team, we envisioned a dashboard that shows how many objects will be impacted in the next version, which objects will be removed, and the current API calls being made to these deprecated objects.
We did some research in open source projects and found two good candidate tools: Kube No Trouble and Pluto. They are similar in that you can run these command-line tools against a running cluster to get data about impacted objects. We chose Pluto because it had a more comprehensive data set at the time.
We built a thin wrapper around this tool to get the results and create Prometheus metrics, which we call ShipDeprecatedObjects metrics. These metrics have labels such as the cluster, namespace, API group, version, and name of the object. We can use these metrics to create dashboards and visualizations.
For API calls, Kubernetes has a built-in mechanism where API calls to deprecated objects are annotated with something like "deprecated: true". Since we send all our EKS API server logs to Grafana Loki, we can easily look up API calls with this annotation and use them as part of our dashboard.
Bart: You mentioned creating dashboards similar to GKE's insights. How did you transform this data into actionable information for both your team and your customers?
Tanat: From both data sources mentioned earlier, we can create a dashboard for the platform team to answer critical questions like, "Are we ready to perform an in-place upgrade on this cluster safely?" In other words, the dashboard shows zero objects that will be deprecated or removed in the next version and zero API calls to deprecated objects in the cluster. As cluster maintainers, when we see this, we can proceed to upgrade with confidence that nothing will break API-wise.
For our users, this dashboard is also super useful because we can share a slice of it with each tenant. Since SHIP is a multi-tenancy platform, each user has their own namespace. We provide a dashboard slice with their specific namespace where they can view their deprecated objects and track progress. We tell them, "We want to upgrade in this period; here's your dashboard with deprecated objects, please take care of it." This self-service approach frees us from chasing everyone to make upgrades.
Even though EKS now has a similar built-in dashboard, it wasn't the case when we were building this. Our sliced version remains useful in a multi-tenancy platform because, from the EKS perspective, it's cluster-level, whereas our approach is tailored to each tenant's deprecations. So, our investment was not a waste.
Bart: Once you had all this preparation in place, how did you approach the actual Kubernetes upgrade process? Did you dive in or take a more gradual approach?
Tanat: I mentioned two main concerns that we tackle: node rebuild and API deprecations. These should ideally solve 95% of the problems, but in production, nothing is guaranteed to go smoothly. For example, I read a post-mortem from Reddit where they upgraded Kubernetes from version 123 to 124. Even though they tested everything in the dev environment, they still faced a major outage when deploying in production because some labels were removed from the control plane or nodes.
It's always wise to upgrade in phases or waves. We use the concept of cluster waves by defining a set of clusters that are less critical based on their workload. We start by upgrading less critical clusters first and then wait a day or two to ensure nothing goes wrong. After that, we gradually move to higher-critical clusters.
This approach gives us more confidence because we've performed the upgrade in production, albeit on less impacted systems. The key lesson learned from the Reddit incident is that production environments are fundamentally different from testing environments, no matter how thoroughly you prepare.
Bart: After implementing these in-place upgrades, what benefits did your team and your customers realize compared to the previous blue-green approach?
Tanat: For us, this approach freed our effort immensely. As I mentioned earlier, normally an Kubernetes upgrade would take a whole quarter to finish with coordination and preparation. Now, it's only a matter of a week or two at most to complete. All of this time we've saved can be used to provide more features or to mature this mechanism better.
For our users, it also changed their perception significantly. In cases where they don't have any deprecated APIs, it's almost seamless. How it works is that we tell them their pod might be restarted due to the upgrade on a specific day. During the upgrade, their pod will be restarted and moved to a new node, and that's it. No action is required from them, compared to the past where they had to work hand in hand with us and do traffic shifting together.
Moreover, in terms of benefits for our team, we've also gained confidence in the process because we do it quite frequently, unlike in the past when we might do it only once a year. With these frequent rollouts, we're executing the process smoothly and have more confidence in our approach.
Bart: Your article title mentions this is beyond the one-click update. Despite your success with in-place upgrades, you're still investing in blue-green upgrades too. Why is that, and what limitations or future improvements are you considering?
Tanat: As of today, in-place upgrade is our default strategy for customer upgrades. However, we still have blue-green deployment in our toolkit. This is what we learned from running Kubernetes at scale: the bigger the cluster, the more problems it's associated with. For example, having more pods leads to larger metric storage units, more action on the reconciler, more congested networks, and introduces a higher risk surface. If one cluster with 1000 nodes goes wrong, many pods can be affected simultaneously.
Our experience shows that a cluster shouldn't exceed 300 or 500 nodes. Once the cluster reaches this threshold, it's time to implement blue-green deployment and slowly migrate some traffic or workloads to the newer cluster.
Bart: In hindsight, looking back at this entire journey, what would you have done differently if you were starting this Kubernetes process today?
Tanat: Most of the issues we found during the upgrade were struggling with the node pool process and its serialized and non-serialized approach. These issues have now been addressed by moving to Karpenter. If I had known how much it would help, I would have moved to Karpenter earlier.
I would also focus more on ensuring users have consistent Pod Disruption Budget configurations to minimize customer impact during upgrades. Configuring a default pod disruption budget would be beneficial, but it requires caution—overly strict defaults might block the upgrade process.
Lastly, implementing metrics that track node review performance and its correlation with Service Level Objectives (SLOs) would increase our confidence in this area.
Bart: We recently had a guest, Matt Duggan, who suggested that Kubernetes should offer an LTS (long-term support) version to address upgrade challenges. Based on the work you've done at Adevinta, do you agree with this perspective?
Tanat: Having LTS is a good option because different companies have different needs and priorities, and not everyone has time to invest in upgrade mechanisms. EKS now also has an LTS offer with increased costs. However, I personally still prefer ensuring frequent upgrades because we were stuck in the past. With the new pace of Kubernetes versions coming out every four months, it's better to get familiar with the upgrade process and mature your tooling and post-mortem process to keep up. One middle ground that my team and I have been working on is called Double Jump, where we upgrade version to version at the same time. This means upgrading, for example, from 1.30 to 1.32 with only one effort, which is possible by using Karpenter, as I mentioned in another blog post.
Bart: You've written many good articles and shared your experience with the community. What drives you?
Tanat: I think it stems from my past, where I used to work in a much smaller company and was just an engineer who enjoyed reading blog posts from bigger companies with larger scale. How do they do things? Now that I get to work in one of these big platforms, it's my passion to share this with other people. It reminds me of myself in the past who wanted to read about such interesting stuff.
Bart: What's next for you?
Tanat: I still have many blog posts I want to write. Now I'm interested in AI and LLM, especially how they could help SREs in their day-to-day work. I will try to write about these topics as well.
Bart: Tanat Lokejaroenlarb suggests people can get in touch with him through professional networking platforms like LinkedIn or through the Adevinta tech blog where he has published articles such as "The Karpenter Effect".
Tanat: I'm around on LinkedIn, so ping me anytime.
Bart: I can definitely say from first-hand experience that it worked reaching out to connect with you. Thank you for your time and for sharing your knowledge with the community. I hope our paths will cross soon.
Tanat: Take care. Thank you so much, Bart. Cheers.