Migrating to Karpenter: Fun Stories

Mar 3, 2026

Host:

Bart Farrell

Guest:

Adhi Sutandi

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Running multiple Kubernetes clusters on AWS with the cluster autoscaler? Every four months, you face the same grind: upgrading Kubernetes versions, recreating auto scaling groups, and hoping instance type changes stick.

Adhi Sutandi, DevOps Engineer at Beekeeper by LumApps, shares how his team migrated from the cluster autoscaler to Karpenter across eight EKS clusters — and the hard lessons they learned along the way.

In this episode:

Why AWS auto scaling groups are immutable and how that creates upgrade bottlenecks at scale
How the latest AMI tag accidentally turned less critical clusters into chaos engineering environments, dropping SLOs before anyone realized Karpenter was the cause
Why pre-stop sleep hooks solved pod restartability problems that Quarkus's built-in graceful shutdown couldn't
The case for pod disruption budgets over Karpenter annotations when protecting critical workloads during node rotations
How Karpenter's implicit 10% disruption budget caught the team off guard — and the explicit configuration that fixed it

Listen anywhere

Transcription

Bart Farrell: In this episode of KubeFM, I'm joined by Adhi, who's a DevOps engineer at Beekeeper by LumApps, where his team operates large-scale Kubernetes workloads across AWS and GCP. This is a deeply technical discussion focused on cluster autoscaling, upgrade mechanics, and workload reliability. We walk through Beekeeper's migration from the Kubernetes cluster autoscaler to Karpenter, covering the operational pain points that triggered the move, including immutable autoscaling groups, Kubernetes version churn, and the inability to flexibly change instance types during upgrades. Adhi breaks down how Karpenter behaves in practice, AMI drift detection, consolidation behavior, implicit disruption budgets, and why default settings can surprise you if you're not careful. We also go deep on pod restartability, including why application-level graceful shutdowns fail in production, how pre-stop sleep hooks improved SLOs, and what happens to long-lived TCP connections when stateful services like Redis are disrupted. Finally, we talk about using pod disruption budgets over Karpenter-specific annotations, treating the platform as a product with Helm and GitOps, and the concrete lessons Adhi would share with any team considering a similar migration. If you're responsible for Kubernetes upgrades, autoscaling strategy, or production reliability, this episode is packed with hard-earned lessons. This episode of Kube FM is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through training Courses are instructor-led and are given in person as well as online. They are 60% practical and 40% theoretical. Courses are given to individuals as well as to groups, and students have access to the course materials for the rest of their lives. For more information, go to learnkube.com. Now, let's get into the episode. Adhi, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Adhi Sutandi: Oh, hi, Bart. Thanks for having me. First thing first, I'm looking forward to Helm 4. Right now, this has been released already, but Helm 4 offers the server-side apply thingy that I'm looking forward. This will handle all the synonyms that is introduced by the three-way handshake patterns. And then next is the eBPF, everything related to eBPF. I'm looking forward to have like a streamlined operational that helps you with the observability, security, and of the observability. Yes and then the last thing is a gateway Kubernetes gateway API so this has been already discussed in the community but it's just I really looking forward to for this domain in this and trying the K gateways and stuff like that to where we actually have where there is implementation. First-class support implementation for the ML inference, machine learning inference.

Bart Farrell: Very good. And Adhi, for people who don't know you, can you tell us about what you do and who you work for?

Adhi Sutandi: Sure. I'm a DevOps engineer. I'm based in Zurich, in Switzerland, and I work in the API by LumApps. I work in the DevOps team and we are a team of eight people now, and we work remotely. We have people in France, we have people in Poland, and me, myself, I'm in Zurich.

Bart Farrell: Fantastic. And how did you get into Cloud Native?

Adhi Sutandi: That's a good question. I was lucky enough. In my first full-time employment, I had exposure already with Kubernetes, and I would say I have fallen in love with Kubernetes. Despite all the complexity. And it was a nice experience dealing with Kubernetes and containerization in general. Those are valuable and I stick with it until now. It's been enjoyable. That's it.

Bart Farrell: Very good. And what were you before getting into Cloud Native?

Adhi Sutandi: Before getting into Cloud Native, I had tried several roles like I had tried QA engineering position. I also tried becoming a system admin and I had the work with Oracle infrastructure as a DevOps engineer and before Cloud Native at some point also I was I would consider myself a Java developer so that was me before Cloud Native.

Bart Farrell: How do you keep up to date with the Kubernetes and Cloud Native ecosystem? Things move very quickly. What resources work best for you?

Adhi Sutandi: Yes, so it's a bit challenging to keep up with all the new technologies, all the shiny things coming from the community. One I would like to mention first maybe is the SRE weekly subscription. This is email subscription where there is a summary of nice articles written around the area of SRE. But that's not necessarily mean it is related to cloud native, But from time to time, From time to time, you see articles about the cloud native and how people are using it. This is very nice exposure to these emerging things because it also gives you the perspective on how other companies are using the technology. And second, I would say conference help. I try to attend conference and talks from AWS, from GCP, these kind of things. And last year, I attended, for example, KubeCon, and I would strongly recommend this conference because it gives you all the summary of what's happening. You just attend the sessions, and then you get to know what other people are doing, how they are doing things, and how what are the emerging technologies and how they are used. This is kind of like putting a good summary about that. And last but not least, of course, going into the release blogs of Kubernetes. But this can sometimes be a bit too long, so I would like to read it from someone else's summary. Some blog posts in Medium. Those are the sources that normally I try to keep up with this emerging technology, say with the cloud-native space.

Bart Farrell: And if you had to go back in time and give your younger self a career tip, what would it be?

Adhi Sutandi: Wow, that's a tricky one. I would say I would rather try to stick with the cloud-native spaces. Long story short, I left the cloud-native space. When I joined a big enterprise and I spent a few years there. And then I think I lost track of cloud native space during that time. And I regret that because those years will never go back again. And if I could come back in the past, I would keep telling myself, hey, chase your passion with cloud native technologies and try to stick there. Don't just go wild trying new experience with any other companies, but stick with what you actually are interested with.

Bart Farrell: Okay, fair enough. But to be fair, we are digging into this topic deeply today. It's a good sign that you're back on track with the things that you're passionate about. As part of our monthly content discovery, we found an article that you wrote titled Karpenter and Beekeeper by LumApps. Fun stories. We want to dig into this a little bit more. But before we dive into the technical details of your Karpenter migration, can you give our listeners some background on Beekeeper by LumApps? What kind of platform are you building? What does your infrastructure look like? And what's your core tech stack?

Adhi Sutandi: Sure. At Beekeeper by LumApps, we are actually delivering a solution for one-stop solutions for frontline workers. We are a SaaS company and our product is all in one platform for non-desktop workers. These are the people that we refer as frontline workers. We have solutions integrated inside our platform. For example, chats, shift management, task management, and analytics, and so on. We are merging with the other company. That's why the name Beekeeper by LumApps. But originally we are just beekeeper. And now we are merging to become a bigger company. We now not only deliver, focus on the market of front lines, but we want to also focus on the market of the enterprise customers. Right now our techs from the platform perspective, from the operational, so we run a few Kubernetes clusters. We run eight Kubernetes clusters in AWS and we run four. Kubernetes cluster in GKE, in Google Cloud. And some of them are, of course, for development. Some of them are less critical cluster, let's say for internal tooling and stuff like that. In average, our workload runs 1500 pods in each cluster. And each cluster has roughly maybe 40 nodes. That's the ballpark number. And for the tech stack, what applications we run, we run Mace. Mostly Java services, Java applications written in Quarkus or Dropwizard framework. And of course, we do some standard stuff like the databases, persistency, and so on. For the databases, we run MySQL and PostgreSQL. And we have a caching solution implemented for this database engine. And there we run ProxySQL and Redis. We self-host these two caching solutions inside the Kubernetes cluster itself. And they run as a performance. We run them mostly for the performance improvement so that we don't throttle our databases and we just improve the overall execution of the requests, end-to-end requests.

Bart Farrell: Many organizations running Kubernetes on AWS face the challenge of keeping up with frequent version updates and managing node scaling efficiently. What prompted your team to move away from the traditional cluster autoscaler with managed autoscaling groups over to Karpenter?

Adhi Sutandi: All right. Kubernetes is great for scalability. You can scale in, you can scale out your workload easily. But there is another caveat here. Comes with another cap it basically so that how about the scalability of the kubernetes cluster itself so like I mentioned we run few kubernetes clusters and how about the scalability of running those kubernetes cluster so these are the thing that we tried to address and that's what I wrote in the article that was the main point about the article so it comes with the challenge that basically Kubernetes release cycle every four months. And AWS, for example, it only supports three latest versions if we want to stick with the standard support. Meaning that if we run the older versions that they don't categorize as standard support anymore. We go into the extended standard support and cost us like way more operational costs. Like, they still keep our version running, but it's just we pay more. And this challenges us to make the operational to always be up to date. Meaning that there is an urge to always keep up to date, to keep our Kubernetes cluster up to date every four months. And then this will be this will be a hassle if we times it with the number of Kubernetes clusters we run and the number of the time we need to update for each cluster. And so that's the first thing. And the second is about there is a limitation with auto scaling groups when we run our EKS cluster with our cluster autoscaler. Basically, auto scaling group does not support. It is designed to be immutable, meaning that we cannot just change the instance type there easily and hope it will stick there. And the third is actually we do with the cluster autoscaler and our autoscaling group design, we lack of the out-of-box solutions to choose which machine actually fits us the best. Let's say we want to optimize cost, right? So we want a solution that actually lets us to always run the optimized machine types from the cost perspective. That's what we like from our setup. And so the idea was actually was already addressed by Karpenter at that point. When we first started looking at this, there was not yet. EKS auto mode being released. We run our POC at that time and we proved the point in development that those limitations could be addressed already with Karpenter. We decided to just run our own self-hosted Karpenter. And looking back at it, so actually the way we designed it was simple enough and Karpenter was not that complicated to manage. We are quite happy even though we run, we self host Karpenter at our cluster.

Bart Farrell: And you mentioned that modifying instance types in an already running auto scaling group was particularly problematic. Can you explain what specific AWS limitation you encountered and why this was such a blocker for your team?

Adhi Sutandi: Sure. Beneath the surface, basically what happened when we upgrade the Kubernetes cluster. Is we updated the Kubernetes API server that is managed by the AWS. And when we updated this, then we have a mechanism to update our launch templates. When the launch templates, what we mean by updating a launch template, specifically we updated the AMI versions of this to match the Kubernetes server versions. And then from there, using this updated launch template, we run the auto-scaling group with the latest launch templates in AWS. And then this auto-scaling group, we refresh the auto-scaling group that result in the rollout of the worker nodes that has the version that match the Kubernetes API server. The problem is that when we update the launch templates and we refresh the instance auto scaling group in AWS, it actually does not take into account the change we made. This is because the design of auto scaling group should be immutable. This was something that we discovered by accident. And we discovered that basically when we refresh the auto-scaling group, it will only use the original auto scaling group spec list including the list of the instance types and their priorities which instance type we prefer over the other. During the refresh it actually just used this original specification as the source of truth and then this actually This actually implies that there is no flexibility of adding new instance type when we have already running auto scaling group. And this is actually not nice from the operational perspective. Let's say if we want to add a new machine type, if we want to remove the machine type, if we want to reorder instance type, so we want to prioritize a specific instance type over the other. We raised this to the AWS support. We raised a ticket and they replied and they linked us to the documentation that basically confirmed these conditions. And so this justifies all the operational overhead we are facing. Basically with the inflexibility of managing autoscaling group, this comes also not only we need to recreate autoscaling group. When we update the Kubernetes cluster, but also when we need to update the instance type, when we need to reorder instance type. And this will become like a snowball effect when we run a lot of Kubernetes clusters. This consumes, even if we script it, maybe one, two hours for every Kubernetes cluster. And then we times this with the number of Kubernetes clusters we run, and we times this with the release cycle at least and on top of that we add also this burden of inflexibility to add or modify the instance type in the auto scaling group.

Bart Farrell: After deploying Karpenter v1.1.3 to your EKS clusters running Kubernetes 1.31, what were the immediate benefits you observed compared to your previous setup?

Adhi Sutandi: First we yes so we had fun with Karpenter and we run When we tried Karpenter in our development environment, we wanted those premises in the beginning, the promises that Karpenter delivered, we wanted to testify that. What we had at that time was the latest, when we defined the Karpenter node pool, we also defined the AMI selections there that actually correspond. And we use the latest tag there. And that actually proved the points that basically with this latest tag, you can just update the Kubernetes API server and then Karpenter immediately noticed that there is a drift here because of the Kubernetes nodes, they are running version that does not match with the Kubernetes API server. It detects immediately the drift event, and then just start rolling the Kubernetes nodes one by one. This is very interesting. This is what we want to test to see with this latest tag, and it does the job. This is allowing us to do the one-click Kubernetes update. And another benefit that we see is actually that we have the Karpenter's support the custom resource definitions. Internally, we use GitOps for managing our workload deployed in Kubernetes. And with this Karpenter native support with the custom resource definitions and also with the node pool spec and stuff, so it is pretty much the same as managing a normal application. Let's say a developer wants to try a new instance generation from the, let's say, M8 machines or M9 machines later. Developer can just add the spec to the Git over this node pool spec. And then later it will just push it and then we have it there. It's easy to have this. And the third is actually that we finally have an autoscaler that allows us to automatically pick the instance type and optimize the selection based on cost perspective. And this is awesome because we can now just narrow it down. Like among so many instance types that are available from AWS, we can just pick, like we narrow it down to the instance type that we want to run and which instance generation we want to so that we don't run into a version of the instance that is too old, like maybe running an older version of Intel CPU, so we don't want that. That's also balanced us from the perspective of keeping the operational excellence and also keeping the cost down as much as possible.

Bart Farrell: Your article mentions that using the at latest tag for AMI Selection and Karpenter effectively turned your less critical clusters into chaos engineering environments. For listeners who might not be familiar with AMI Drift, can you explain what was happening and how it impacted your services?

Adhi Sutandi: Sure. That was actually a fun discovery. When we use the latest tag, when we define these AMI selections, we actually let Karpenter to periodically check if there is an upstream, if there is a new AMI release from the upstream release. This actually caused us to have no control or little to no control over our rollouts. When there is a drift, Karpenter from what is AMI versions that Karpenter expects and what is actually we run, then Karpenter detects those as a drift and as the result if there is a drift event, Karpenter will just roll out the nodes. By putting latest tag we have no control over this it all depends on how frequent the aws team or the OS team push the new AMI to the upstream release. That was the thing that we discovered. And as the result, we actually have so many frequent rollouts in the development cluster and other. A less critical cluster where we actually allowed this latest tag because we want the benefit of having one click infrastructure upgrade and it turns out it was we started to notice there is a there is a drop in the SLOs but at that point we did not expect that it was significantly caused by the Karpenter we assumed there was like something maybe the there is a because it's is the nature of development cluster so we thought that it's just like applications being broken because of bad codes and something like that. Until at some point we noticed that the increase of business on and off business hours alerts. And then we just drill down to it and try to narrow it down like what was the cause. And then we discovered that this was actually the cause because we coincidentally, not coincidentally, so it was on. At the same time the node rotation happens, then we saw that this drop in SLOs and of course the business hours alert that happening as the consequences of SLOs dropping from this surface. Yes, so and then we actually at that point we had already noticed that there is a warning from the Karpenter website. Mentioning that we should not use latest tag for our AMI selections because it's basically letting us have no control over when the drift will be detected by Karpenter. These are our learnings and we discovered that this is mimicking the Chaos Monkey engineering. Which actually fun like we tested at that point we tested how restart event like how the rollout of the nodes actually cause restart of the application and reshuffling of the pods and how the reshuffling of pods actually impact our business continuity so this is very interesting at that time to observe and we discovered that basically when it comes to restartability, we were not doing the best at that point. But thanks to this setup in development, we discovered this issue.

Bart Farrell: When investigating the reliability drops caused by node rotation, you discovered issues with application restartability in your Quarkus services. What did you initially try, and why didn't the standard Quarkus shutdown configurations work as expected?

Adhi Sutandi: Yes, so, if this Unexpected actually in the beginning that we run into SLO drops that our reliability matrix went down when there is a pod reshuffling. Because we have already thought about this at that point, we thought that we already have Quarkus shutdown delays implemented and we already put term termination grace period seconds which also matched the settings for this Quarkus shutdown delay. We thought we covered there, but turns out that it was not the case. We run Quarkus version 3.8 at that point, but then the setting didn't work for us. What we observed at that point that actually we noticed that when there is a SIGTERM being sent by a kubelet, Quarkus pod just immediately exits, which we don't want to. We want Quarkus to have like a delay. It's still, it's being halt after the SIGTERM And then it continues to receive a traffic until it gets killed by termination grace period. This is what we want. But it did not happen. And we decided, at that time, we did not investigate. Any further. Why? And then that was just a reliability pattern that we noticed. And there was another thing about our Quarkus application that basically we, in some of them, we implement like hard dependency to our caching solutions, meaning that if the caching solution is not available, the service immediately fails. This was something that we discovered that it was not a good, or let's say it was not the best reliability pattern, because with the reliability pattern, we should assume at some point our dependency will fail. And we should cope with that. That's what we discovered.

Bart Farrell: And the solution you ultimately implemented was a pre-stop sleep hook, which was inspired by a KubeCon talk on pod restartability. For listeners who want to visualize what's happening here, there's an excellent flowchart diagram at learnkube.com slash graceful shutdown that shows, we'll include this in the show notes, that shows the pod termination sequence. Can you explain why adding this sleep before shutdown works better? Than relying on the applications built in graceful shutdown mechanisms?

Adhi Sutandi: Sure. At that point, what happened was actually, so it's still the case, I think. What happened beneath is actually when pod is being deleted or pod is being reshuffled or basically when pod is terminated. The termination happens in parallel with the removal of the routing. To the pod. This happened, I think, maybe I would say asynchronously, but basically in parallel with the pod termination or pod deletions. And there is no guarantee that the routing components have already been updated to not route any in-flight requests to the pod. That is being terminated. There is no such a guarantee in the Kubernetes mechanism. This is what we're already aware. And we know that the remediation is actually to put a bit of pause so that the routing components, such as kubeproxy, update already the IP tables and stuff before we actually terminate the pod. We expect this behavior to happen when we set up the Quarkus a shutdown delay config but it was not working like I mentioned it was actually just it Quarkus just exit immediately upon the SIGTERM and this is something that we thought we addressed and then during the talk in the conference or in KubeCon conference it was being emphasized as like a powerful solutions to pod restartability which actually quite commonly overlooked by people. We just tried to change our Quarkus shutdown. We replaced our Quarkus shutdown config with this pre-stop sleep. And then it did what we want actually. Basically it lets QProxy to update the to finish updating the IP tables and routing before actually terminating the pod. Any in-flight request will still go to the pod and then it will, the moment it is terminated, we are sure that there is no subsequent request coming to the same pod, to the terminated pods. That's what it does. And so that's what we discovered.

Bart Farrell: Great. Now, beyond application shutdown, you also encounter problems with long-lived TCP connections to your Redis cluster. For teams running stateful services like Redis inside Kubernetes, what specific failure pattern did you observe during node rotations?

Adhi Sutandi: Yes, so when we talk about stateful services, normally we pay too much. Focus on the data persistency. This is what it means by stateful, but we overlook the fact that connections is also stateful. And then the Redis, in our case with Redis, basically our applications are using long-lived TCP connections to connect to Redis for, again, for caching queries to the DB to improve performance. But when the Redis pod rotates, so the TCP connection is terminated so they are gone and this cause the applications to see the slight unavailability of the Redis And like I said, some of the applications, they are not implementing the best reliability pattern. They just have this hard dependency on Redis. The moment Redis is unavailable, the pod immediately enters into unhealthy state, which effectively removes them from the service endpoints. And this also contributes to the fact that to the SLO, the worsening of our SLOs at that time. Yes, so then this is something that we noticed that we should improve. And we simply improve it by allowing the fallback mechanism. Basically, when the caching solution is not available, we should make the application call the database at the instead of just thinking that the caching solution will always be available all the time.

Bart Farrell: Your solution was to disable the Redis health check from affecting pod readiness. Some might argue this masks potential real issues. How did you weigh the trade-offs and what made this the right choice for your architecture?

Adhi Sutandi: Again, I think this was a bit, this was not the best. Pattern we implemented for our application. We thought of it and then we come up with the solution that basically we should let the query to fall back to our database engine. And so the trade-off here is basically, of course, if we always assume the caching works, we have the best performance, we want the best performance from the cache serving the data instead of going to the DB. But then this comes with the hard dependency, the hard assumption that Redis is always there. And I think for the trade-off, we trade-off between performance so we are talking about the trade-off between the performance and availability here so if we fall back to the mysql we sure that we have like higher availability because and then if we have like the if we choose to always to make to keep the same design basically to let red to let us application assume that redis is always be there. We hard fail when cache is not available, meaning that we ensure performance. But looking back at our metrics, and we saw that this slow query or the slow performance from our end-to-end latency is actually just like a small part of the request distribution. We can overlook that part. We choose to have more availability by implementing just fallback to the database. When there is this small percentage of the slow query happening, in the worst case, it goes to the DB. But we still serve the remaining of all the majority of the end-to-end requests that actually do not involve this kind of slow queries in the process. That's the trade-off we choose. And so looking back at it, we noticed that actually the health check mechanism that assume Redis should be always available brings us more disadvantage compared to the benefits.

Bart Farrell: You mentioned using a platform as a product approach with Helm charts to roll out these fixes across all clusters. How did this organizational pattern help you respond to the issues you discovered?

Adhi Sutandi: Regarding this, actually, I got inspired by reading the articles written by the Spotify engineering team about treating the platform as a product and how their approach is when they I think they are talking about how fast it is to patch company-wide security vulnerabilities or bugs using this approach. Basically it is what the mental model that we're trying to adopt here. Thanks to Helm, we have we have our manifest wrapped nicely, Kubernetes manifest wrapped nicely and version. This we have already. We have a tool already at our disposal to allow us to implement the same mental model as this platform as a product. We push this version with internal developer tooling. And this allows us to control easily and to observe easily which applications have already updated, which applications have not been updated. It's a nice approach to see and observe how you improve your time to deliver the solutions company-wide. This is very interesting. And we are talking about platform here, not like the software versions, but we treat the platform as a version software. That's nice. I know it's kind of like an abstract thingy, but basically the mental model allows you to see the platform operations differently. And this allows us to have more confidence when there is a frequent task to change something in the platform level, like just implementing pre-stop sleep. This can be baked into our Helm chart, which is used by all of our applications. And then we just push this for all of our applications without significant delays. We just ship this new configuration to all of our applications in no time. Transcends the ownership and stuff like that. We have ownership of the application still belongs to application developers, but anyone in the team, of course, including the developer, including us, me, the DevOps engineer, we can push this for everyone in the company for having the better time to deliver the solution.

Bart Farrell: For protecting critical workloads, proxy SQL from Karpenter's consolidation events, you evaluated two approaches, pod disruption budgets and the Karpenter-specific do not disrupt annotation. Why did you ultimately prefer pod disruption budgets over the annotation?

Adhi Sutandi: Yes, so those two are achieving the same purposes so they are used to prevent Kubernetes to affect the nodes. During the event of draining or during the event of the server initiated interruptions and so pod disruption budget is simply better because it transcends all from all the autoscaler so it does not necessarily tie into Karpenter but basically this can be used to any of the autoscaler you do that do you use and This is also very nice because it comes with the label selections that allows you to select which pods or which pods across many different deployments that you want to have a budget on. That you don't allow this kind of pod with certain labels to get interrupted. To get interrupted massively. Yes, so that's the first thing. And second is also that we talk about the annotation. The annotation only works for Karpenter-specific purposes. And also when it comes to rolling out the solutions, when we, let's say we have the Kubernetes update and we want to compare this. To whether we, you know, which one actually gives us the best. If we have annotations, we need to remove it at pod level, not at stateful set or at deployment or at replica set level. We don't need to remove the annotation there, but we remove it at pod level. However, let's say if we have like 20 critical applications, then we need to actually remove the annotations from all of them. Of course we can script this, but then it's just in the longer run. If you have complicated setups, if you have many critical applications in different namespace, with different deployment and different tolerance over the failure, you cannot just remove all of them at once. And how about their placement also, if they are placed on the same node, you need to control which one is actually critical and which one is allowed to be disrupted at the same time. This is just not scalable, I would say. And then when we decided to, what happened if we want to remove the annotation? For good, then we need to roll out the replica set because we are changing the replica set via like deployments that will set DaemonSets So we need to change it at that level so we remove the annotations there and this cause also another overhead of rolling out the replica set. Meanwhile with Pod disruption budget we have like a selector by label and we can put percentage over there it can select it can select any of the pod, it can, it applies to any of the pods that actually have this the label that we would that is specified in the pod disruption budget. And in the event of the update, we just, we set the minimum of, we increase the max unavailable, or we decrease the minimum unavailable. From let's say if we have like critical application we have we define max unavailable to zero because we want all the pods to not get disrupted and then in defense of the Kubernetes update we just increase this to one so we allow one of the these critical pods to be rotated and then we unblock the update process with this.

Bart Farrell: Both pod disruption budgets and the do not disrupt annotation essentially block node draining for critical workloads, which creates a challenge during full cluster upgrades. How do you handle Kubernetes version updates when these protections are in place?

Adhi Sutandi: It's true. It's true that those two are actually having the same purposes. They are used to prevent a cluster-initiated maintenance event or disruption event. And then the cluster and then they behave the same they are used for the same purpose but the difference is that we can easily refer for the when we are using pod disruption budgets well we are blocked by the when we during the event of the update we want the we want the cluster to upgrade. All the nodes in sequence or in parallel, but we want them to complete in the events of update. In the events of the update, then we Kubernetes update, so we don't want this to impede or block the process of the Kubernetes cluster upgrade. We want to lax them. And then in the event of the, how do we lax them? So basically with pod disruption budget is much easier. In Pod Disruption Budget, we just need to increase the max unavailable or decrease the minimum unavailable depending on the original setup. But we want max unavailable to be as small as possible and we want to increase this in the event of the update, vice versa for the minimal available setup in the Disruption Budget. Meanwhile, when we use the annotations, when we roll out an update or we are doing the update window. This annotation, we need to remove it from the pod level and when they are recreated they are back. Meaning that we need to redo it again. And like I said, if this is like just one, two critical services, maybe annotation, removing annotation from the pods, even though if they are running 20 pods, if they are running 30 pods, it's okay. But what if we have many critical applications and how do Kubernetes synchronize them if they run in different machines and something like that. How do we want to allow this? So this is just creating a lot of scripting if we want to keep using annotations. But with Pod Disruption Budget, we just categorize all these critical workloads into the same label, and then we just select the We implement a budget based on this critical label. And we allow one of them to be unavailable or we allow we allow we force minimum 90 availability from this critical workload. This is more convenient and more scalable to just use pod disruption budget compared to the Karpenter annotation.

Bart Farrell: One of the more subtle gotchas you discovered was Karpenter's default 10% disruption budget behavior. Can you walk us through how this implicit default caught your team off guard?

Adhi Sutandi: This was another interesting discovery that we found. We initially, when we saw the documentation, we saw that there is the capability to mention a budget for disruption and also what we can narrow it down to which budget corresponds to which consolidation events like underutilized, empty or drifted. We made an assumption that if we allow only the underutilized and empty event to occur with the minimum budget. We thought that implicitly we tell Karpenter to not allow the drift event to happen. This is what was that about. And we discovered that basically it's not. The case so that actually there is Karpenter requires the explicit budgeting for all the reasons if you if you don't specify it then the 10 percent by disruption budget applies to a reason that is not that does not fall to the condition you mentioned so this is the this is the thing that caught us off guard so that we noticed that basically the drift event was with this because of the latest tag we use in our less critical workload. This caused a drift to still happen during the window that we did not expect to. And then this was contributed to the fact that actually we assume that implicit budget happens in Karpenter. But it's not the case. It requires explicit budgeting for all the reasons defined.

Bart Farrell: After understanding the implicit defaults, what disruption budget configuration did you settle on for your less critical clusters and why?

Adhi Sutandi: In general, our approach is we want to have like one maximum, sorry, we want to have maximum one disruption at the time. This is ideal. We have only one node rotated at any time for any reasons. Then this is just simple. Then we know because of this what we found out about the Karpenter settings. We noticed that we should actually just remove the qualifiers. We remove the budget qualifier for the underutilized and empty. By removing this qualifier it applies to all of the settings so it applies also to drift so we only allow one drift to happen so and then also we if we have scheduled windows then we just we just we just we just specify which event or consolidation event triggers the during this window and what budget we allow it so this is actually just making it simpler so basically we it's simple just set like budget to one and then do not put any reason there and then this is actually resulting in and at any node at any condition just one node maximum to be replaced so this is I guess desirable for most of the use case.

Bart Farrell: Okay and for teams out there that are considering a migration from cluster autoscaler to Karpenter what are the top three lessons or pieces of advice that you'd share based on your experience?

Adhi Sutandi: Okay, so first thing first, I would advise people to not use the latest tag. We, well, it's fine from our side, we did not promote that until the critical clusters. We discovered that during the POC and also during the promotion to the less critical clusters. For us, it's actually allowing us to have one-click infrastructure upgrade, but it is not advisable to do that, again, because the release of the AMI and the release of the AMI image is out of control. It's not under our we don't monitor it's not on our release cycle so we don't have we have little again we have little to no control over there. And yes, so they mentioned that warning for a reason. That's what we discovered and I advise to not use latest tag. Unless you really want to have this situation, which could be beneficial. If you want to have like a chaos engineering test, want to test your environment, if it can happen at some point, the infrastructure is being rotated, you treat them as a failing event, although it's not the case, but it's actually good that it will test your application. Ability to recover from the sudden interruptions or the interruption that you did not expect to happen at any time because again it's not under your window. Regarding this also it does not apply necessarily to just using Karpenter as autoscaler but it does apply for the using Kubernetes in general, that you need to take care of your and you need to analyze your application ability to handle restart. And like I mentioned, pre-stop sleep actually helps a lot there to handle restartability. And next is again PDB is introduced for a good reason and your favorite PDB for handling this, especially for setting up the bare minimum of the pods that should run on your cluster. And you should use this and favor this among the other autoscaler specific prevention of any other cluster. Autoscaler means to prevent the disruptions you should just use PDB because this is universal that should apply to any with other autoscalers and last but not least also that you should really look into the Karpenter documentation to understand it before you start using it so you should you will not fall under the same pitfall that you this especially with this explicit need of having the budget mentioned for all of the reasons or termination reasons and you should not assume that there is an implicit disruption budget just because you define something explicit. That's it. And for those people who are still doubting about adopting Karpenter, I would actually encourage you to start looking at it because this is again, it gives a lot of benefits. And I give my shout out to people from the Karpenter community on Slack that help addressing one, two questions from our side. And of course to the people. The great people that I met at KubeCon 2025.

Bart Farrell: Fantastic. What's next for you, Adhi?

Adhi Sutandi: All right. Next is, I think this doing this platform engineering, SRE and DevOps thing, these are like never-ending games. Once we unlock a level, there will be a next level to unlock. It's the same for everything that falls under the umbrella of CNCF, under the cloud-native technologies. And I think it's always, for me, next is just I need to always sharpen my skills. And of course, to keep me updated, always keep myself updated with all the implementations with all the discoveries that have been done by other companies, by other enterprises, by other organizations because these are great ways to summarize the great technology out there. And keep keeping ourselves in the loop to make sure that we're always aware of how our things are done, the technologies behind, under the CNCF, the technologies under the under the cloud native, they are emerging, they are advancing really fast. Just to keep up with it, I would like to keep myself informed and attending conference maybe. From time to time, attending workshops, attending seminars, and then also to, of course, to do my self-reading from the articles, tech articles, tech blog posts from Medium all the time. Basically, long story short, keep exploring and geeking, I would say.

Bart Farrell: I like that a lot. And Adhi, if people want to get in touch with you, what's the best way to do that?

Adhi Sutandi: If people want to get in touch. I'm always open to discussion. You can always send me a message on LinkedIn. And I will try to reply to those messages.

Bart Farrell: Fantastic. Well, this is your first podcast. It definitely won't be your last. I look forward to our paths crossing again in the future. Thank you so much for being generous with your time and your knowledge and helping others grow, facing these challenges in the Kubernetes ecosystem when it comes to making the right decisions, understanding the trade-offs. Look forward to seeing what's next. Take care and have a great day cheers

Adhi Sutandi: Thanks Bart cheers

Migrating to Karpenter: Fun Stories

Relevant links

Transcription