Kubernetes at scale: Emerging tools, AI, and invisible infrastructure
In this interview, Mike Stefaniak, Head of Product for Kubernetes and Registries at AWS, discusses:
Three emerging Kubernetes tools to watch: Kube Resource Orchestrator (KRO), Karpenter, and specialized schedulers like Volcano and Kueue.
Scaling AI/ML workloads on Kubernetes infrastructure: Technical challenges of running massive foundational models, including AWS's recent achievement of supporting 100,000-node EKS clusters.
The future of Kubernetes as an invisible platform: A vision where Kubernetes becomes a hidden infrastructure layer, with developers interacting through internal developer platforms that abstract complexity.
Relevant links
Transcription
Bart: So, first things first: Who are you? What's your role? And where do you work?
Mike: Sure. Thanks, Bart. My name is Mike Stefaniak. I currently work at Amazon Web Services (AWS). I lead the product team for Kubernetes at AWS, which includes Elastic Kubernetes Service, our cloud-hosted Kubernetes service, as well as EKS Anywhere, our on-premises distribution of Kubernetes. More recently, I also lead product for Amazon Elastic Container Registry, our OCI-compliant container registry service at Amazon. I really consider it the unsung hero of container services because it's hosting everybody's images and is the backbone of the rest of container services here at Amazon.
Bart: And what are three emerging Kubernetes tools that you're keeping an eye on?
Mike: There are a lot of tools coming out in Kubernetes, seemingly new ones every week. I have three in mind.
One is Cube Resource Orchestrator, or KRO for short. This is a tool AWS open sourced at KubeCon last year that helps manage and coordinate Kubernetes resources, such as deployments, services, config maps, and even resources representing cloud infrastructure. It acts as a conductor for your Kubernetes resources, understanding relationships, creating them in the correct order, verifying successful deployment, and handling rollbacks.
In a real-world example, if you need to deploy a full application with a database, an API server, and a web front end, the orchestrator would handle the entire process: deploy the database, ensure it's ready, deploy the server once it's healthy, and then deploy the front end—all in the correct order. We've open sourced this tool and have interest from other cloud providers who have started working together on the project. We're even considering donating it to the SIG Cloud Provider in the Kubernetes world.
Another tool is Karpenter, which is close to becoming the new standard for compute management in Kubernetes. AWS open sourced this project a couple of years back to address challenges of traditional VM auto scaling. Instead of pre-provisioning multiple node groups, Karpenter consolidates everything into a single node pool with just-in-time provisioning. It evaluates workload requirements and cloud provider options to select the most performant and cost-effective compute resources.
The third area we're exploring is Kubernetes schedulers. The built-in Kubernetes scheduler works well for traditional web online workloads, but new schedulers are emerging for specialized workloads. These include Volcano (a specialized job manager for complex batch processing), Kueue (a lightweight job queuing system), and YuniKorn (designed for migrating Hadoop Yarn environments to Kubernetes). These new schedulers are emerging because Kubernetes is attracting workload types the traditional scheduler wasn't originally designed to handle.
Bart: Now, our podcast guest, John McBride said that Kubernetes is the platform of the future for AI and ML, particularly for scaling GPU compute. Do you agree with this assessment, and what challenges do you see in running AI workloads on Kubernetes?
Mike: Based on what I see from our customers, trends in the industry, and activity in the open source community, I would agree with that assessment. Over the last 10 to 15 years, we've seen patterns with rapidly evolving technology like AI, where the activity and innovation happens in open source. Contributions come from companies, both small and large, as well as research labs and hobbyists.
Kubernetes is a natural fit for AI workloads because of its open source roots, extensibility, and portability. Its core primitives are built to support varying AI workload patterns at massive scale. We see evidence of this with numerous Kubernetes-native projects for AI workloads and GPU infrastructure vendors investing in Kubernetes compatibility.
Today, you cannot release a new AI tool or be an AI company without Kubernetes support on day one. The Kubernetes community is embracing this trend, as demonstrated by the recently established Kubernetes AI conformance working group.
Bart: I notice that while the transcript snippet is about infrastructure challenges in model training and Kubernetes, the actual transcript text is missing. Could you provide the full transcript text so I can properly analyze and hyperlink relevant terms?
Mike: The scaling demands keep growing. We've seen firsthand the challenges of running some of the world's largest foundational models on EKS. Some challenges are unique to model training workload patterns, but it really comes down to the nature of longer-running distributed jobs and tuning the system to sustain performance as the workload reaches massive scale.
At AWS, one of the jobs of the EKS team is making sure we're bringing AWS infrastructure innovation to Kubernetes customers. This includes support for multiple network cards and CNI plugins, support for Elastic Fabric Adapter to do OS bypass networking between multiple nodes. We've worked deeply with NVIDIA and EC2 on newer instance types to ensure they work in Kubernetes. If NVIDIA releases new instance types to handle larger scale and they don't work with Kubernetes, people simply won't use them.
From a Kubernetes side, the Kubernetes control plane is probably the first bottleneck when running workloads at this scale. We recently announced at the AWS New York Summit EKS support for clusters of up to 100,000 nodes, which is an industry-leading scale. Significant work went into this, detailed in a deep-dive technical blog.
This included offloading the consensus layer of etcd to an internal Amazon service that allows horizontal scaling, various Kubernetes optimizations like streaming list responses, improving cache hits from paginated reads, scheduler tuning for gang scheduling preemption, and work in Karpenter to support higher scale.
We've been pushing the boundaries of scale in Kubernetes and are proud of our recent announcement. However, the demands keep coming. Multi-cluster scheduling workloads are probably next because eventually, at a single point, the scale demands become so high that even a single cluster is not big enough.
Bart: Multi-cluster is a challenge. We used to ask folks a cliché question: What's easier to learn, multi-cluster or surfing? People had significant doubts about how to answer that.
Mike: I've tried surfing three times and haven't made it up yet.
Bart: With GPU costs being so high, what strategies are most effective for maximizing GPU utilization in containerized environments?
Key potential hyperlinks I've added:
"GPU" links to NVIDIA's GPU sharing documentation
"GPU utilization" links to the same NVIDIA technical documentation
"containerized environments" links to Kubernetes' container overview
Note: While the transcript is very short, I've added contextually relevant links that provide additional technical depth about GPU usage in containerized environments. The links come from the provided LINKS table and are technically appropriate to the context.
Mike: Cost optimization and ROI for AI workloads are top of mind for nearly every customer. We expect GPU costs to normalize over time, but it's important to ensure customers have the right levers to control and optimize costs. This is a big effort for both the service team I own and the technical solutions architect community at AWS, who have developed best practices, guides, and workshops to help customers save money.
These strategies include optimizing hosted images, pre-installing drivers, plugins, and components. We've been working on optimizing image pull time between the ECR and EKS teams. Other techniques involve loading models directly onto the GPU to bypass storage, and using GPU instance features like time slicing or multi-instance GPU from NVIDIA to ensure workloads use just the resources they need.
Various technologies are emerging in GPUs to improve efficiency. For example, EKS recently released an auto-repair agent, which is crucial because broken GPU instances are very expensive. Quick identification and replacement of faulty hardware is key. We also recommend custom scheduling strategies and optimizing existing infrastructure investments, such as EKS's Hybrid Nodes feature, which allows connecting on-premises GPU infrastructure to a Kubernetes control plane in the cloud.
Cost optimization remains a critical focus for our customers, field teams, and services at AWS.
Bart: Bart Farrell: Mike, Kubernetes turned 10 years old last year. What should we expect in the next 10 years?
Mike: If I'm still actively managing a big Kubernetes product team in the next 10 years, I think I've failed. The future of Kubernetes isn't about exposing more complexity, it's about hiding it. I think we'll see widespread adoption of internal developer platforms that abstract Kubernetes entirely for application teams.
Developers aren't going to be interacting with things like deployments and stateful sets; they're going to be interacting with concepts like applications and databases. Self-healing, self-optimizing systems will become the norm. AI agents are going to handle troubleshooting. Kubernetes became popular because of its hands-off, self-healing approach, and that's going to become more prominent with AI agents becoming involved.
We're already seeing expansion in workload types with AI/ML, edge workloads, and global multi-region applications. People want to standardize all of their workload types on a single platform. The most significant shift will be that Kubernetes will become less visible, just like you don't really think about Linux anymore. It's just a layer in the stack.
My hope is that in 10 years, Kubernetes will be just another layer, powering applications while letting developers focus on delivering value.
Bart: What's next for you?
Mike: It's getting into busy season at AWS. There's a countdown clock when I walk into the office every day—a little over three months to go, around 100 days until re:Invent, our annual conference. We have some pretty exciting features in the EKS and ECR teams, which will keep myself and my team very busy over the next couple of months.
KubeCon is also coming up. I've been to every KubeCon in the last five or six years and certainly don't intend to miss this one. We tend to highlight our open source work at KubeCon, so myself and a number of engineers from the EKS team will be at the conference. If you happen to see me there, please come say hi. In the short term, it'll be a busy next couple of months for us here at AWS.
Bart: Great. I look forward to seeing you at KubeCon for sure and possibly at re:Invent. But if people want to get in touch with you, what's the best way to do so?
Mike: LinkedIn. I'm probably the most active there. Message me on LinkedIn. I'm pretty responsive. If you happen to be at KubeCon, come find me.
Bart: Perfect. Mike, thanks so much for your time today. Look forward to crossing paths with you soon. Take care.