Kubernetes Upgrades Without Fear

Mar 31, 2026

Guest:

Jason Deal

Kubernetes upgrades are among the most feared operations for platform teams — and most of that fear stems from not having the right tools or automation in place.

Jason Deal, Software Development Engineer II at AWS working on Karpenter and EKS Auto Mode, explains how PDB-aware infrastructure tools remove the guesswork from rolling upgrades, and why treating nodes as cattle rather than pets is the key to upgrading without fear.

In this interview:

How KRO (Kubernetes Resource Orchestrator) and ACK (AWS Controllers for Kubernetes) can replace custom operators with declarative automation
Why PDBs only protect you if your infrastructure provider actually respects them — and what to use instead
The hidden edge cases in topology spread constraints that can silently violate your high-availability setup
What DRA (Dynamic Resource Allocation) signals about where Kubernetes is headed in the next decade

The thread tying it all together: pressure-test your automation before you need it, not during an emergency.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Relevant links

Transcription

Bart Farrell: So first things first, who are you? What's your role? And where do you work?

Jason Deal: I'm Jason Deal. I'm a software engineer at AWS. I mostly work on Karpenter and EKS Auto Mode, and then just other Kubernetes things.

Bart Farrell: What three emerging Kubernetes tools are you keeping an eye on?

Jason Deal: so I guess the two that come to mind and kind of come as a package deal, or can be packaged together really nicely are KRO and ACK. So KRO is the Kubernetes Resource Orchestrator. So it's a tool that started here at AWS. And the way that I've described it to some people who are unfamiliar with it is it is almost a declarative way of writing an operator, though it can do way more than that. So it allows you to do these very complex automations based on custom resources in your Kubernetes cluster using the same tooling you're familiar with for pretty much everything else Kubernetes. And I think that pairs really nicely with ACK, which is the AWS controllers for Kubernetes. which allow you to use these building blocks for AWS primitives combined with the automations that KRO enable to do very complex like infrastructure automation. And then the other one that's not so emerging but is something I'm definitely keeping an eye on because I work on it a lot is Karpenter. And so we recently graduated to v1 I think a little over a year ago but even after that graduation there's still a lot of ongoing work going. Some of the interesting things that have happened lately have been like additions to make custom hardware for certain like accelerated workloads easier to use with auto scaling and also prioritize reserved capacity.

Bart Farrell: Kubernetes upgrades are one of the top reasons teams struggle with operations. What patterns actually work for maintaining uptime during cluster upgrades?

Jason Deal: so this is definitely one of the pain points that we knew about when we were building out tools like Karpenter and EKS Auto Mode. So the side of this that I'm mostly aware of and familiar with is the data plane rollout. So rolling out all your worker nodes, get the new Kubelet version. And this can be a painful process for a lot of customers because you do have to take application downtime when you're dropping those nodes off. Now, there are a lot of tools in Kubernetes to help mitigate this. Things like PDBs to ensure that you maintain a certain number of replicas of your workload in the cluster. But those only work if your infrastructure provider is actually aware of these Kubernetes level constraints. If you're just using an ASG and self-managed nodes and you let them roll, that's going to bypass your PDBs potentially. So the nice thing about tools like Karpenter and EKS Auto is that we're PDB aware. We can take advantage of these native Kubernetes mechanisms for ensuring you maintain application availability, and only actually migrate nodes once we know we can do so without violating those PDBs.

Bart Farrell: What should platform teams focus on to build confidence and stay in current? From an engineering perspective, when does automation actually reduce risk?

Jason Deal: so I think the main thing is taking advantage of automation and not snowflaking. With nodes and even with clusters, we really want to live by the mantra of treat everything as cattle, not as pets. And once you build out that automation, based on the fact that these are cattle, it can be brought up, brought down, and so on. Once you build out that automation and you can pressure test that automation, that is going to be the best mechanism for being able to do these upgrades without that fear because you've already pressure tested it. You're already used to these disruptions and normal situations. So there's not really a fear that you have when performing upgrades.

Bart Farrell: You tested applications through pod failures, node outages, and AZ failures. What surprised you most about which configurations actually mattered?

Jason Deal: I think one of the interesting ones that most people in Kubernetes are aware of, but aren't necessarily aware of some of the edge cases around is topology spread constraints. Topology spread constraints are, of course, extremely important for maintaining high availability when you want to spread your application across different failure domains. Like in AWS, this would typically be availability zones. But there are a number of caveats with using them that some customers aren't aware of. Things like them being only a scheduling time construct in that they could be violated as they're only evaluated when the pod is actually bound to a node. And as time goes on in your cluster churn, they could be violated. So using tools like Descheduler to ensure conformance long term can be super valuable. The other thing that I've actually seen relatively recently is some people running into cases where TSCs end up matching against multiple rollouts of the same deployment, even though one will not exist when the other one is fully rolled out. And this can result in skew being violated by the time a rollout completes. There was actually a relatively recent feature added to topology spread constraints in Kubernetes. I think it's match label key. So you can actually match your TSC against the pod template hash. So you're only matching against one set of a deployment, not across like two rollouts of the same deployment.

Bart Farrell: Looking forward, how should Kubernetes platforms evolve to reduce operational burden of upgrades?

Jason Deal: so I think that it really goes back to what we were talking about in the first question in building up those automations, being able to pressure test those automations. Ensure that the signals you have for application availability, like your pod ready checks and PDBs, etc., are strong enough that those automations can use them as signals for rollouts without needing to rely on manual intervention to ensure application health. And being able to do this pressure testing in low-stake scenarios rather than when you need to do an upgrade because your version's about to lose support or something definitely helps build that confidence long-term. The one thing where I think the platforms are going to be evolving, at least in the near future, is we need to build out better primitives for modeling this for more stateful applications which can't necessarily be modeled by PDBs. I think this is something that we're going to be evolving on in the Kubernetes space in the next coming years.

Bart Farrell: Kubernetes turned 10 years old a while back. What should we expect in the next 10 years to come?

Jason Deal: so what we've seen a lot at this conference and at the last KubeCon and NA is there's a lot of talk around features like DRA and workload-aware scheduling. And I think DRA is particularly interesting because it highlights the much greater diversity of types of devices we're running in Kubernetes, whether these are accelerators, TPUs, or even a new host of devices that we might not be thinking about yet. The cool thing about DRA is that the framework that it sets out really gives us a mechanism to express the flexibility we need for these future devices. So I'm looking forward to see how that continues to evolve. I think there's already somewhere in the ballpark of 12, 15 KEPs to continue to evolve the feature, and I expect we'll continue to see more.

Bart Farrell: What's next for you, Jason?

Jason Deal: so continuing to work on Karpenter and EKS Auto Mode, I think there's a lot of evolution that we're going to be continuing to do in the autoscaling space, particularly with integration with these new features like DRA, workload aware scheduling, etc. One thing that is particularly interesting in some conversations we've been having here at KubeCon is some tighter integration between the scheduler and the autoscaler so they can make better upfront decisions about potential capacity. And I'm looking forward to see how those conversations continue to evolve.

Bart Farrell: How can people get in touch with you?

Jason Deal: so active in the Kubernetes Slack Karpenter channel. And there's the Karpenter channel for end users, Karpenter dev, if you're interested in contributing. We also have bi-weekly working group meetings. So I'm usually at those. And then, of course, up on the Karpenter GitHub.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via