We Broke Our EKS Cluster Autoscaler with the AL2023 Migration

Jan 13, 2026

Host:

Bart Farrell

Guest:

Dilshan Wijesooriya

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Dilshan Wijesooriya, Senior Cloud Engineer, discusses a real incident where migrating EKS nodes to AL2023 caused the cluster autoscaler to lose AWS permissions silently.

You will learn:

Why AL2023 blocks pod access to instance metadata by default, breaking components that relied on node IAM roles (like cluster autoscaler, external-DNS, and AWS Load Balancer Controller)
How to implement IRSA correctly by configuring IAM roles, Kubernetes service accounts, and OIDC trust relationships, and why both AWS IAM and Kubernetes RBAC must be configured independently
The recommended migration strategy: move critical system components to IRSA before changing AMIs, test aggressively in non-production, and decouple identity changes from OS upgrades
How to audit which pods currently rely on node roles and clean up legacy IAM permissions to reduce attack surface after migration

Relevant links

Transcription

bazel-bin/packages/transcriber/cli_/cli [00:00] Bart Farrell: What actually breaks when you migrate an EKS cluster to Amazon Linux 2023 and why? Today on KubeFM, we're joined by Dilshan, a senior cloud engineer working on Kubernetes platforms and EKS. In this episode, Dilshan walks through a real incident his team hit during an AL2023 migration, where the cluster autoscaler silently lost AWS permissions and began failing in production-like conditions. We unpack the hidden assumptions behind node IAM roles, how AL2023 changes access to instance metadata, and why relying on node roles for system components eventually becomes a liability. Dilshan explains the correct use of IRSA, the separation between AWS, IAM, and Kubernetes RBAC, and how to approach these on migration safely before deprecation deadlines force your hand. This is a practical discussion for teams running EKS at scale who want to avoid discovering identity and autoscaling failures the hard way. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through Kubernetes courses. Courses are instructor led and are 60% practical, 40% theoretical. Students have access to course materials for the rest of their lives. They are given in-person and online, to groups, as well as to individuals. For more information about how you can level up, go to learnkube.com. Now, let's get into the episode.

[01:19] Dilshan Wijesooriya: (instrumental music plays) You're tuned in to KubeFM.

[01:24] Bart Farrell: Welcome to KubeFM. What are three emerging Kubernetes tools that you're keeping an eye on?

[01:29] Dilshan Wijesooriya: The first one, Lens, the Kubernetes IDE. I like Lens because it's, kubectl plus bunch of YAML into something you can actually see and navigate, quickly, like the cluster namespace, workloads, logs, event, and all in one place. It lowered the cognitive load of, working with Kubernetes, especially when you are, juggling multiple clusters or trying to understand, an unfamiliar environment. for platform teams, it's interesting because it makes kubectl feel, very less hostile for the application developers instead of teaching everyone a dozen, kubectl commands. You can hand them on a UI, but still respect the RBAC, give them a clear window into the apps, and the clusters. That means less support queues, tickets, and, faster, feedback loops when debugging. then secondly, vcluster, a virtual cluster on top of a single cluster. I really like the idea of spinning up, virtual clusters for teams, or an environment on top of a physical cluster. It is elegant way to give a strong isolation and a clean API surface to the developers without exploding the number of, real clusters you have to manage. For platform teams, it's interesting because it could, change the trade-off between one big, multi-tenant clusters or too many, small clusters. And thirdly, Kueue, the batch and job, scheduling for Kubernetes. the batch or data side of Kubernetes still, not as matured as long-running, service side. Kueue is so nice, because it bring more structured job orchestration, queuing, fairness, resource aware scheduling across multiple job frameworks. For organization that run, both microservice and data workloads on the same cluster, tools like Kueue might be help to stop, one side overfocusing everything.

[03:29] Bart Farrell: In terms of, who you are and your role and where you work, can you just give us a bit of background about yourself?

[03:34] Dilshan Wijesooriya: Sure. I'm Dilshan, originally from Sri Lanka and currently working as a senior cloud engineer in the Netherlands. day to day, I work on Kubernetes platform, which is in EKS and everything around it. Infrastructure automation with CDK, Terraform, Pulumi, scaling, security, and making sure the deployment teams, and can ship features quickly without worrying about the underlying, cloud infrastructure. practically, that means I spend a lot of time in designing and operating cloud, infrastructure, improving how we handle things like reliability, scalability, observability, and obvious, the cost. the building internal platforms that other teams can reuse instead of, reinventing the wheel. I've been operating in cloud for about nine years for now across AWS, Azure, and GCP. And for the last few years, most of my energy has gone into Kubernetes heavy environments.

[04:34] Bart Farrell: And Dilshan, how did you get into cloud native?

[04:38] Dilshan Wijesooriya: yeah. I definitely didn't start with cloud native. My first role was very traditional, application support and system engineering, looking after bare metal and VM-based workloads, managing servers, patching them, fixing outage, all the classic softwares. Over time, I got pulled into automation, almost by necessity. We had too many manual deployments and too many, work on my machine incident. So I started, scripting deployments, setting up Jenkins pipeline, and standardizing how we configure the environment. that actually led into the cloud more seriously. The turning point for cloud native especially was, with team I work with, started containerizing application. Kubernetes first appears as a new thing we should experiment with, and then suddenly, it was running real workloads. I was one of few people who enjoyed digging into how actually working within the network, scheduling the controllers. So I gradually shift into building and running Kubernetes platform on Azure, GCP, and EKS. It was not a big band career change, more just following the pain points and the interest in the problems.

[05:53] Bart Farrell: What were you before cloud native?

[05:55] Dilshan Wijesooriya: Yeah. Before cloud native, I looked after Linux and Windows servers, application servers like Tomcats, databases like Oracle, MySQL. a typical day would be anything from troubleshooting a memory leak on application server, restoring a database backup, or manually deployment a new version in the middle of the night. It was very trigger-driven environment, okay, based on change requests, scheduled releases, and quite a bit of firefighting. we did have some automation like shell scripts, maybe some configuration management, but nothing like GitOps or infrastructure as code. a lot of knowledge lives in people's head and some runbooks which is not in version control. the background is still useful, by the way, when something weird happened on an old draw in the network, that old school system, getting muscle memory still helps.

[06:57] Bart Farrell: And the Kubernetes ecosystem moves very quickly. How do you stay up to date? What resources work best for you?

[07:04] Dilshan Wijesooriya: for news, I follow Kubernetes release notes, and change log of the tools we actually run on production, like the auto-scaler, ingress, service mesh, and observability. and I follow KubeCon talks, especially the incident and war story style talks. and for deeper learning, I like, long-term content like blog posts, the migration write-ups where people show their architecture, trade-off, and the mistakes. that's also how I try to write myself, including the AL23 article that led into this episode. and honestly, a lot of learning come from straight from work. we run EKS at a decent scale, so, a new Kubernetes or AWS features quickly become a real requirement.

[07:55] Bart Farrell: And if you could go back in time and share one piece of career advice with your younger self, what would it be?

[08:03] Dilshan Wijesooriya: I would say on few things, but more deeply. early in my career, I tried to touch everything, every cloud, every tool. It's fun, but it's quite shallow. I would advise a younger self to pick a couple of key areas, for example, like Kubernetes, and maybe one cloud provider, and go very deep into there first. Once you have that depth, it's much easier to generalize and learn new technologies later. And you are more valuable to your team because you are a person who can really debug and design, not just install things.

[08:37] Bart Farrell: As part of our monthly content discovery, we found an article that you wrote titled, "We broke our EKS cluster autoscaler during Amazon AL 2023 migration and fixed it. Here's what we learned." So we want to get into this a little bit more deeply with the, with the following questions. So AWS has announced that Amazon Linux 2 AMIs for EKS will be deprecated after November 2025, which means many teams are now planning their migration to Amazon Linux 2023. Can you walk us through what prompted your team to tackle this migration?

[09:08] Dilshan Wijesooriya: Yeah. for us, it was a mix of deadline and opportunity. on the deadline side, AWS has been very clear, AWS Linux 2 for EKS is on the way out. after November 26th, this year, it's last week, they stop, publishing the new AL 2 AMIs for EKS. And Kubernetes 1.32 will be the last, version that will get the L2 images. So, if you want to stay current with Kubernetes, you just can't, pretend AL to be there forever. and from the opportunity side, AL 2023 actually give us things, that we care about. better security in default, an updated kernel, and a supported window that align better with, what we want our platform to be in a few years. So, we saw it, we have to move it anyway. Let's do it on our teams and learn as much possible. And, we can do it non-production before the pressure is on the paper, it looked like fairly routine infrastructure upgrade. we could do with almost no visible impact. In the reality, we discover something, hidden assumption in the hard way.

[10:24] Bart Farrell: And before we look at exactly what went wrong, let's talk about the migration strategy that your team planned. What was the approach that you decided to take for switching your EKS nodes to AL 2023?

[10:37] Dilshan Wijesooriya: Yeah. we chose what I would call the like, the sensible boarding strategy, but the one everyone recommended. So, we spin up a new node group using AL 23 images, and we cordon and drain our old AL 2 nodes. And then the we let the cluster autoscaler and schedule they, do their job. this is basically a blue/green deployment migration. We have done similar pattern before for the other changes, and it was well-documented in the Kubernetes autoscaler world. So nothing exotic. that's the part what makes this incident interesting. The strategy itself was fine. The problem was in the assumption about how the autoscaler was getting, its AWS permission, which became, visible once we changed the underlying AMI.

[11:30] Bart Farrell: Now, for listeners out there who might not be familiar with the difference, AL 2023 isn't just a version bump from AL 2. There are some significant under-the-hood changes. What are the key technical differences the team should be aware of?

[11:44] Dilshan Wijesooriya: yeah, there are a lot of changes, but, from Kubernetes operator's point of view, I would highlight, into two layers. First, the obvious ones, AL 23 switched from yum to dnf as the package manager. and it's, shipped with a newer kernel and, it tighter the security falls by general. those the thing, you notice quickly when you log into the node. on the second layer, more subtle for us. and more impactful, how instance, metadata, and permission are handled. AL23 is much stricter by default about the pod being able to reach the EC2, instance metadata service and borrow the node AMI role. on AL2, the behavior made very easy, maybe too easy. It was letting pod, rely on the node's role. on AL23, the that shortcut actually disappear unless you deliberately re-enable it or redesign it, your identity store. That was the key difference, that turned our AL23 migration into just an OS change to, "Oh, I was... Autoscaler can't talk to the AWS anymore."

[13:00] Bart Farrell: Now let's get more into the incident itself. After you switched to the AL2023 node group, things started breaking. What were the first symptoms your team noticed that indicated something was wrong?

[13:11] Dilshan Wijesooriya: Yeah, it started quietly, which is always the dangerous kind of failure. the first thing we noticed was the observability side. some of our Datadog agent stopped reporting metric from certain nodes. when our monitoring goes dark, you're in a interesting story. a little bit later, start seeing workload misbehaving, service crashing, health checks turning red, pods not getting scheduled by the way we expected. At this point, it was obvious that the autoscaler was the root cause. It just looked like a cluster was in the bad mode. And, when we dug into the cluster autoscaler log, we found the key error. Like, it was saying, failed to get nodes from API server, unauthorized. that was the moment where all the story clicked. the autoscaler itself couldn't do the job because it was suddenly unauthorized to talk to pieces it need, and everything else just a consequence of that.

[14:12] Bart Farrell: The error message mentioned unauthorized access to the API server. For teams running EKS clusters, understanding how pod permissions work is critical. Can you explain what was actually happening under the hood that caused this failure?

[14:25] Dilshan Wijesooriya: Yes, under the hood, the issue was about how the autoscaler get, its AWS credentials and the fact, that we'd been relying on invisible side effect for years. On AL2, the cluster also, autoscaler pod was effectively using, EC2 instance role from the node. It was, running via the instance metadata service. we never gave its own, dedicated IAM role, at the pod level. It was just working fine, so it faded into the background. With AL2023, access from pod, to instance data, much more stricter by default. so when the autoscaler moved to AL2023 nodes, the old path to the credential was blocked. from the autoscaler point of view, it suddenly became broken and do look like it's the, lost the ability to assume the roles, nodes role. the key point is we didn't change the autoscaler helm charts. We didn't change the config, but we changed the AMI. So, and that exposed the hidden dependency we didn't realize for we had.

[15:37] Bart Farrell: So when the cluster autoscaler lost its AWS permissions, what specific capabilities did it lose and how did that cascade into the broader service outages you experienced?

[15:48] Dilshan Wijesooriya: Yes. once the auto autoscaler lost the AWS permission, it's basically went blind on the AWS side. so the autoscaler is no longer describe scale, scaling groups and instance, set desired capacity for the node so terminate the instancing the autoscaling group. Also, look up for any, launch template and instance type. So when we cordon and drain, the old AL2 nodes, the autoscaler saw, pending nodes but had no able to tell, "Hey, AWS, I need some more nodes over there." The result was kind of slow burn incident because, the old node disappeared from the cluster as we drained them. then the new nodes were not being created. Pods like Datadog agents and some critical services were stuck opening, a stage. From the outside, it looked like the scheduling or capacity problem, but the real issue was the control loop between the Kubernetes and AWS had silently lost its key.

[16:52] Bart Farrell: Now IRSA, or IAM roles for service accounts, is the AWS recommended way to grant AWS permissions in pods in EKS. For listeners who haven't implemented it before, can you explain at a high level what IRSA is and why it's more secure than relying on IAM roles?

[17:11] Dilshan Wijesooriya: Yes, IAM roles for service account, also known as IRSA, is the AWS recommend way to give EKS pod permission to call the AWS APIs. Instead of every pod on a node effectively sharing the same node IAM role, we create a dedicated IAM role, with specific policy, then create a Kubernetes service account for the workload, and finally link them with the OIDC Trust Plus annotation on the service account. On the pod side, that service account can be assumed that role. Compared to the node IAM roles, it gives a, like, much tighter least privilege, better blast radius, compromise, like example, a compromised one pod doesn't give, us the full node level, AWS permission. And it give a clear, auditing and reasoning. AL2023 effectively removed the easy path, to the node role via metadata, so IRSA goes from the nice security best practice to this is actually how you should doing it, if you want to pod, if your pod wants to talk to AWS.

[18:25] Bart Farrell: And now when setting up IRSA for the cluster autoscaler, there are several components that need to work together. At a high level, what are the key pieces that teams need to have in place, and could you tell us more about how do they connect?

[18:39] Dilshan Wijesooriya: At high level, there are three main, major building blocks, I would say. the first one, IAM roles and policy. So we defined an IAM role policy, with exactly what the cluster autoscaler need, like debugging, describing autoscaling groups and instance, setting the desired capacity, terminate the instance, describe launch template, describe node instance time. then we attach this policy to IAM role. And secondly, we need a Kubernetes service account. In kube-namespace, kube-system namespace, we created a service account for the cluster autoscaler. this become the identity, the pod run on Kubernetes site. And thirdly, we need a OpenID Connect, also known as OIDC, and some annotations. The EKS cluster need an OpenID provider configured. we then annotate this service account with IAM roles ARN. This annotation is the bridge that tell AWS if a token come from this service account, we are the OpenID provider. It it's allowed to assume this role. once that wiring is done, the autoscaler just use the normal AWS SDK instead of pod. but behind the scene, those calls are going to the IRSA instead of going to the nodes role.

[20:05] Bart Farrell: I think there's something here that catches many teams off guard. You mentioned that setting up IRSA alone wasn't enough. You also need to configure Kubernetes RBAC. RBAC is a bit of a pain point for a lot of people out there in the ecosystem. Can you explain why both are necessary and what each one is responsible for?

[20:23] Dilshan Wijesooriya: Yes. This is something that trips a lot of people. IRSA and RBAC solves completely different problems. IRSA controls what your pod can do in AWS like autoscaling calling, autoscaling APIs, EC2 APIs, maybe S3 or SQS depending on their component. Kubernetes RBAC controls what your pod can do in Kubernetes like list nodes, watch pods, update lease, cache node status, and et cetera. the cluster autoscaler needs both. It needs AWS permission to change, node group capacity, and it needs Kubernetes permission to watch workload nodes and update node annotation and the stateful let's make decision. So, you can have perfect, IRSA and still a broken autoscaler if your RBAC is too strict, or you can have a wide open RBAC, but, still the IRSA is wrong. and it will not be able to talk with AWS. Thinking of them as two separate, access plane in a one cloud on a cluster really helps when you debug, these kind of issues.

[21:38] Bart Farrell: And now after implementing the fix, you mentioned cleaning up legacy IAM access from the node group rule. Why is this cleanup step so important, and what are the risks of leaving those permissions in place?

[21:50] Dilshan Wijesooriya: Once IRSA, in place and tested, leaving the old permission on the node draw become unnecessary risk. If some pod now or in the future somehow get damaged to instance metadata service, it could potentially still assume node's role. If the node role is still carrying broad permission like autoscaling, node groups, we could effectively have hidden backdoor for more, for more powerful AWS actions. By stripping those permission, from the node group rule so we can reduce the attack surface, then we made it much cleaner which pod is allow to do what, and we avoiding confusion later where people might not, know if something was using IRSA or just falling back to the node role. It's like a classical cleanup after yourself step. Once the new, least privileged path is working, close the old ones.

[22:49] Bart Farrell: Now, EKS pod identity is a newer alternative to IRSA that's gaining some traction in the ecosystem. Your team considered it, but stuck with IRSA. What were the factors that influenced that decision and when might pod identity be the better choice?

[23:06] Dilshan Wijesooriya: we are definitely interested in pod identity. It's a nice evolution of the story. But for this specific incident, IRSA was the pragmatic choice based on few reasons. Because we already had OIDC and IRSA configured on the cluster for some other purposes. our cluster autoscaler health charts fits very naturally with the IRSA pattern of service account plus annotations. And our IAC, CDK code already had pattern and help us to create service account with IAM roles in that way. Introducing secondary identity model in the right in the middle of the time-sensitive migration, would have added risk and cognitive load. IRSA give us a direct, well-understood path to fix the problem quickly. I think pod identity make a lot of sense for new clusters where you can design the identity from the day one, or teams that don't want to manage OIDC providers themselves. So we are not against pod identity at all. We just choose IRSA as the fastest safe fix for the particular outage, but we will definitely consider it in the future.

[24:19] Bart Farrell: This incident happened in your testing environment which prevented production impact. For teams preparing to do this migration in production, what's the strategy that you're now recommending?

[24:30] Dilshan Wijesooriya: Yeah. The big change is we don't need to treat IAM and RBAC as background details. Now they are front and the center. So our recommended approach is before touching your AMIs, move the key system component like autoscaler, external DNS, load balancer controller, and so on to IRSA or pod identity while you're still on AL2. get the IAM and RBAC right. In the world of you understand everything. Then, test aggressively in non-production. Cordon and drain, watch these, watch how the autoscaler behave, read the logs, and make sure scaling actions still work exactly what you expected. When it's solid, introduce AL2023 nodes and, do it as usual cordon, drain from the AL2 nodes. for production, treat like a planned maintenance, communicate with the teams, have a rollback option, and watch your observability closely even if, you're, even if you are expecting, zero visible downtime behavior. the risk of change makes everyone more prepared. The key mindset is don't couple the changes. how the identity work and the OS change in production at the same time. If you can avoid it, decouple it.

[25:57] Bart Farrell: Now, looking back at this experience, what are the key lessons learned that you'd want other EKS operators to take away from this incident?

[26:05] Dilshan Wijesooriya: I would boil it down to few points. first, AL2 let us get away with some shortcuts. Using node IAM role with metadata for system components felt convenient, but it was always bit of a cheat. AL2023 force us to expect this thing, and, then ultimately a good thing. Second, for anything critical that talk to AWS API, like the cluster autoscaler, external DNS, controllers, operators, or pod-level identity like IRSA or the identity agent is no longer optional. It's the model you should go on. And third, remember that IRSA and Kubernetes are because two separate knobs. one for the AWS side and one for the Kubernetes side. You need to both, tuned correctly. And finally, treat IAM upgrade as a application change. Just, not like, not just infra. They can change behavior in very real ways, how things authenticate, how, what allowed by default, how security boundaries look. If you approach them with the right mindset, you much more likely see issues earlier in non-production rather in the middle of a production rollout.

[27:24] Bart Farrell: And for teams who are currently running on AL2 and haven't started planning their migration yet, what would be your advice on how to approach this proactively before the deprecation deadline?

[27:34] Dilshan Wijesooriya: My main advice is start with identity, not AMIs. I would suggest a audit which pod rely on node IAM role today. And not just the autoscaler. It could be the external DNS, AWS load balancer controller, certificate manager, some backup tools, custom controls, you name it. for each of those plan, move to IRSA or pod identity while you are still on AL2. that way you change one thing at a time. then test those changes in, non-production environment. Break them there, not in the production. And once your critical components have clean experience identity story, the into AL23 become much more like a traditional OS upgrade because you are not discovering, IAM surprise at the same time. And importantly, don't wait until, late 2025 hoping it will be a small thing. It's doable, but it touches fundamental part of how your cluster talk to AWS. So giving yourself time to iterate, which make the whole experience much less stressful.

[28:51] Bart Farrell: Dilshan, what's next for you?

[28:53] Dilshan Wijesooriya: On the job site, a lot of my energy is going into making our Kubernetes platform a bit more boring in the good way. that means hardening the foundation, things like identity, autoscaling, observability, and reducing the amount of custom glue that developer have to, think about. And we are also making more seriously about, think, so we are also looking into more seriously at things like the pod identity and where that fits in our environment. So I expect, it will be more experiments there. And then from the Kubernetes side, I like to keep sharing, what I learn. The article about AL2023 migration started as an internal note, and it was nice to see that it gained interest from the other teams. So I want to do more things like that, more blog posts, maybe some talks or podcasts like this, especially around, we broke it, then we fixed it, and here's the honest story. that usually were the most useful lesson gleaned.

[30:02] Bart Farrell: And if people wanted to get in touch with you, what's the best way to do that?

[30:06] Dilshan Wijesooriya: the easiest way is LinkedIn. You can find me on LinkedIn by searching my name, and I also have my personal website, dilshanwijesooriya.me. and or else you can just email me, info@dilshanwijesooriya.me.

[30:22] Bart Farrell: Fantastic. Thank you so much for joining us today and for sharing your experience. I hope our paths cross in the future. Take care.

[30:29] Dilshan Wijesooriya: Thank you. Bye.

[30:30] Bart Farrell: (instrumental music plays) You're tuned in to KubeFM.

Listen anywhere