Kubernetes as the AI Infrastructure Platform

Apr 2, 2026

Guest:

Tsahi Duek

Running AI workloads on Kubernetes means juggling GPU scheduling, model serving, distributed training, and agent security — often with frameworks designed for different paradigms. Where do you even start?

Tsahi Duek, Principal Solutions Architect for Containers at AWS, breaks down the full stack: from infrastructure templates and model serving with vLLM, to securing AI agents that execute actions on your behalf.

In this interview:

How the RECON framework (Routing, Engine, Caching, Orchestration, Nodes) maps the essential layers of an AI inference stack
Why Kubernetes became the default platform for AI — and how organizations shift capacity between training and inference on the same cluster
What makes AI agents different from regular workloads, and how to lock them down so a runaway agent can't delete your S3 bucket

Transcription

Bart Farrell: So, first question, who are you, what's your role, and where do you work?

Tsahi Duek: I'm Tsahi Duek, I'm a Principal Solutions Architect at AWS, covering EKS, Amazon EKS, and AI infrastructure, based out of London, working as part of our go-to-market team, which is basically helping customers use our services efficiently, getting signals from customers, feeding them back into the product team, and also finding solutions to bridge the gap when things are not necessarily implemented or not yet on the roadmap and things like that.

Bart Farrell: And what three emerging Kubernetes tools are you keeping an eye on?

Tsahi Duek: In terms of Kubernetes tools, I would start with our own open source tooling, which is KRO and ACK. KRO and ACK basically bridge the gap when you want to architect your deployment in a way that doesn't really Not only taking care of the Kubernetes resources, but looking after any other things like we just had a Dynamo talk at the booth. There's some CRDs that Dynamo introduced. We might want to include them as part of our deployment. And this is where KRO as a composition engine and ACK as an extended interface to AWS API can come really handy. Other things I would guess like in the AI infrastructure space, Kueue, been very popular for things like gang scheduling, putting fault placements in close proximity to each other, again, for the reasons of having GPUs, proximities and network topologies that we want to get the benefit of. Third ones, I would say probably everything around autoscaling. So it's not new, it's good old autoscaling techniques, but things like Cluster Autoscaler, Karpenter, which we open sourced and basically been using the community. And again, in the AI space, it helps you a lot with getting capacity, whether you want to leverage spot instances or other preemption instances across different providers, but also getting things like capacity reservation when you need the capacity when you need it.

Bart Farrell: We're seeing a fundamental shift in how organizations build AI systems. Kubernetes has evolved from running stateless web services to becoming the unified platform for data processing, model training, inference, and now autonomous AI agents. What's driving this convergence, and why are so many organizations standardizing on this approach?

Tsahi Duek: I think I would say the main reason for this one is because the investment that companies already have in managing complex, multiple Kubernetes clusters. So even with You just started with web services, with simple apps being deployed. You have a CI/CD mechanism, you have a cluster autoscaling, you have a which gets your node provisioning, you have your observability system. And in that sense, deploying, it might be a simplification of this one, but in that sense, deploying an AI agent or a model for inference is just another app. Of course, it comes with complexity and all of that, but when you think about the benefit you're getting out of Kubernetes is that you already have the investment of running all of your apps and getting observability and all of that. So this is one aspect of it. The other thing is being able to shift workloads or shift capacity easily in between types of workloads. So, for example, I work with a couple of customers that did some training for their own models. AI models, but when the training ends, they need those capacity to be used for inference, whether it's for reinforcement planning or any other thing like serving those models for their customers. With Kubernetes, they can reuse the same infrastructure being deployed with the same configuration, same device plugins for AI, and just shift them together into a different type of use case or workload. So I guess the standardization of having the investments in and also being flexible in shifting capacity in between types of workloads. I would say these are the two main components why customers keep going and investing on Kubernetes for the new AI era space.

Bart Farrell: As teams build production AI systems, they're working with an entire ecosystem of open source frameworks, tools for ML pipelines, model serving, GPU scheduling, and distributed inference. Which of these frameworks are becoming essential and how do they work together to support AI workloads.

Tsahi Duek: I would like to start with something that I think my colleague basically coined. He called it Recon, R-E-C-O-N, which is basically a framework to build like AI tooling or AI systems, which is routing, E for engine, R for routing, E for engine. C for caching, O for orchestration, and N for nodes. And basically, when you break it down, you have to think about each of the components that you need to operate in each of these levels. So for routing, for example, we have AI inference extensions. In the Kubernetes project, you would want to think about routers, like LLM routers, where you want to shift in between different models deployment that you already have or even things like simple things like load balancing in between endpoints that have the GPU capacity and have GPU utilization The other thing is the engine and when it comes to engine I think most or the most common ones is like, vLLM, SGLang and TRT-LLM, Triton-LLM which customers and users basically use for hosting and running their models. In terms of caching different caching tooling, LLM cache and those kind of things. Orchestrating, this is where tools like Ray, Racer, Ray training, we've seen this a lot, been used to basically take those single deployments of models and spread them across the clusters. Of course, Nvidia just announced their Dynamo offering to be, The Dynamo open source tooling as a GA, like we 1.0, which also allows you to orchestrate different endpoints of models across the Kubernetes cluster. And of course, when it comes to nodes, I already mentioned it, but like, Karpenter, cluster autoscaler, things that get you the node capacity when you need it. I would say that's the overall ecosystem of tooling needed to support AI agents and AI deployments.

Bart Farrell: So we've moved from microservices to data-intensive AI, and now to autonomous AI agents with reasoning loops. What new challenges are organizations facing with these agentic workloads, especially around orchestration, state management, and security?

Tsahi Duek: So I think like the number one thing when it comes to AI agents is mainly around security. Given the fact that those AI agents will go and execute actions on behalf of users, you and sometimes might even run code, you want this agent to be locked down and being confined in a way in a special environment. This is why advancements like agent sandbox in the communities or even things like reusing services that allow you to lock down the agent deployment in a specific isolated environment so it doesn't breach network or getting other security measurements out of the box. So this is one thing about agents. The other thing is, I think the whole AI agent space is shifting toward understanding which services you would want to use or consume from providers like AWS, or you might want to deploy them in clusters. So things like memory, storage, like short-term, long-term storage for memory, things like caching requests and caching responses from those agents, and also being able to control what those agents communicate with. So you won't want to give it a whole, a broad access to your system. You want to confine it in a way that you only get. security access or security tokens for this subset of services. So you might want to model your AI agents in a way that each agent has the ability to interact with one specific system or a handful of collections of systems and not just giving access to all your AWS account or AWS infrastructure. Otherwise, when an agent goes loose, you might find yourself, your S3 bucket has been deleted by mistake.

Bart Farrell: Given everything we've discussed, distributing training, GPU optimization, multi-cluster orchestration, where should teams start? What's the most practical path for organizations building production AI systems on Kubernetes?

Tsahi Duek: So we work a lot in AWS in that space. We have our own, when it comes to AWS, I can talk about AWS, we have a project, an open source project called AI on EKS. You can search it on Google. and basically find a place that's a GitHub project alongside with a website with a collection of infrastructure templates to get started, to get your infrastructure set up with everything you need. We call it like inference ready cluster for inference and training for a training cluster for training. Then based on that we have what we call blueprints for deployments. We have like inference charts where you can easily deploy. a vLLM or a vLLM with RACER kind of model, Qwen3, Llama7B, any type of open weights model. Or if you want to get like really to the specific, we have implementations with like NVIDIA Dynamo, where you want to reuse KV cache across different machines. And then the last piece is what we usually call optimization techniques. This is where you really want to optimize your cold starts for your model. So you want to expedite, you don't want to wait like minutes for the model weights to be pulled from whatever you store your model weights in. You want to expedite your container image startup time. You want to benchmark your model and understand how to compare it across different GPU types and different models that might satisfy your business requirements or even things like GPU utilization. I might want to take my single GPU and shard it in a way that I can fit more workloads. Because not every workload runs constantly all the time, so I might want to fit different types of requests coming in. And this is where things like time slicing or MIG, which is like partitioning a single GPU in different shapes of slices of GPU memory, using concepts like DRA, dynamic resource allocation which is an open source project. again, coming from the Kubernetes community, are really helpful to getting started. So I think the journey we try to make to simplify for customers is getting the infrastructure set up, then you come with the actual workload, like the model deployment, and any other business thinking you want to add on top of it. And then we look at it as a circle loop where you continuously optimize and going back to the infrastructure going back to the model deployment and iterate over time.

Bart Farrell: Kubernetes turned 10 years old about two years ago. What should we expect in the next 10 years?

Tsahi Duek: Oh wow, that's a really challenging question. I think like the AI will shape everything Kubernetes has to offer. Maybe things like AI reasoning tooling for Kubernetes. So when we got used to metrics and observability and scheduling being hard coded and being part of the system, of the predefined system, we might see the AI convergers coming up and being used more LLM and AI tooling to make cluster-wide decisions. The other thing is that I think Kubernetes will become an employee. I think like Kelsey Hightower used to. make this announcement and the more we go, the more it makes sense to me. It will be implementation detail, it will be something that is so incorporated like TCP/IP, like any other protocol we're using which will be on the back end of the services, will be really even more important than these days but it will be incorporated into every other deployment out there.

Bart Farrell: What's next for you, Tsahi?

Tsahi Duek: I think in the last year and a half I've been focusing more on infrastructure to really get infrastructure optimization specifically for model deployments and model training. And this is where I'm keeping my edge on basically keep advancing into that space of, again, GPU consumption, GPU utilization, and how to make this all very efficient across different deployments and different type of workloads.

Bart Farrell: And if people want to get in touch with you, what's the best way to do that?

Tsahi Duek: I think LinkedIn, so you can look for, I don't know if it makes sense. pretty active on LinkedIn and responds pretty fast so you can reach me on LinkedIn or you can try me on the Kubernetes Slack channel, Slack organization. So again, same name, pretty easy.

Kubernetes as the AI Infrastructure Platform

Relevant links

Transcription