Scaling the Kubernetes Control Plane with EKS

Scaling the Kubernetes Control Plane with EKS

Mar 30, 2026

Guest:

  • Alex Kestner

At scale, the Kubernetes control plane stops being invisible — and most teams only find out when things start failing.

Alex Kestner, Principal Product Manager for Amazon EKS, explains when the control plane becomes the bottleneck, how to read the signals early, and what pre-provisioned control plane capacity means for teams running at scale.

In this interview:

  • What signals tell a platform team they've outgrown the standard control plane — API server request latency, API priority and fairness metrics, and etcd database size

  • How pre-provisioned EKS control plane capacity works and why AI/ML workloads and disaster recovery are the top use cases

  • What EKS Auto Mode does and why Alex thinks the infrastructure powering Kubernetes will become increasingly invisible over the next decade

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Transcription

Bart Farrell: So first things first, who are you? What's your role and where do you work?

Alex Kestner: Hi, I'm Alex Kestner. I'm a Principal Product Manager with the Amazon EKS team.

Bart Farrell: What three emerging Kubernetes tools are you keeping an eye on?

Alex Kestner: So I'm biased. I always keep an eye on the Karpenter project because it's something that I launched a few years ago. So whether that's the new features that are coming out through the project or the various ways that we're seeing it becoming more and more embedded in the community, it's one that I very closely keep track of. The others are more emerging proposals from various special interest groups in the broader Kubernetes community. Specifically the workload API or pod group API proposals that will help with large AI training workloads or batch processing, gang scheduling jobs. And then also the dynamic resource allocation proposals, which will be critical for how customers leverage all of the various kinds of accelerated infrastructure that's available.

Bart Farrell: Most developers don't think about the Kubernetes control plane. It just works. But when does it become a bottleneck?

Alex Kestner: So the control plane is absolutely meant to be just invisible and in the background for the vast majority of customers. But when you start to hit the scale of thousands of nodes in your clusters, tens of thousands of requests through the API server, the control plane becomes a real challenge to operate.

Bart Farrell: What signals tell a platform team they've outgrown the standard control plane?

Alex Kestner: So I think the one that's the most critical to watch is API server request latency. As that starts to grow, you're getting that initial signal that there's something that might be worth paying attention to. Other things I like to look at are around the API priority and fairness metrics. These are things that help you understand whether or not different kinds of requests are struggling to be completed effectively. Other obvious signals are around the etcd database size. If that's starting to get filled up, you know that you're going to have a problem relatively soon.

Bart Farrell: EKS recently announced the ability to pre-provision control plane capacity. What does that mean in practice and what changes for teams running at scale?

Alex Kestner: So the provisioned control plane is a key tool for customers that are operating at particularly large scale, so like large number of nodes or really busy clusters where there's a lot of things coming and going. What this lets you do is set a baseline for the infrastructure that we use to power the EKS cluster control plane so that you always know you're going to have the capacity needed to meet the demand of your cluster at any time.

Bart Farrell: What use cases have you seen where teams realized they needed pre-provision control plane capacity?

Alex Kestner: So it's 2026. Of course, AI/ML is the top use case that we see this being a really critical capability for customers. But there's a lot of more interesting ones that maybe you wouldn't think of as being really good candidates for provisioned control plane. The one that I think is the most interesting is disaster recovery. Let's say you're running an active passive setup across regions through different clusters, and suddenly something has gone wrong and you need to flip over to that passive region. The amount of load on the control plane and the API server will really dramatically increase at that moment. Provisioned control planes will make sure that you have the capacity needed to complete that recovery successfully.

Bart Farrell: Okay, bit of a bonus question. You just had a fireside chat with Nana. Can you just give us a summary of the things you were speaking about?

Alex Kestner: So I spoke with Nana about EKS Auto Mode. Auto Mode is a feature that we launched at reInvent in 2024 that dramatically simplifies the operational work required to run Kubernetes clusters at production scale. EKS has always offered a managed control plane for Kubernetes clusters. That's now complemented with EKS Auto Mode's managed data plane. So all of the things that are essential for running Kubernetes clusters, all of the controllers for compute storage and networking, and then also the instances themselves are now something that we can take responsibility for so that you can focus on shipping your applications.

Bart Farrell: Kubernetes turned 10 years about two years ago. What should we expect in the next 10 years to come?

Alex Kestner: So I wish I could tell you that. If I did, I might have a bit of a different job. The reality is that I expect Kubernetes and particularly the infrastructure power in Kubernetes clusters to become more and more invisible and backgrounded as customers get to focus more on the applications and things that really delight their users.

Bart Farrell: What's next for you, Alex?

Alex Kestner: I'm here at KubeCon hearing all about the problems that customers are having, which is the most useful thing for me to hear as a product manager. Meeting customers, talking to them about how they're using EKS and the things that we could bring to them in the future to make their lives better.

Bart Farrell: How can people get in touch with you?

Alex Kestner: So LinkedIn is the best way to reach out. Search for Alex Kestner at AWS on LinkedIn.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via