Kubernetes Tools for AI/ML Workloads

Kubernetes Tools for AI/ML Workloads

Feb 27, 2026

Guest:

  • Nick Eberts

Multi-cluster scheduling, gang scheduling, dynamic resource allocation — Kubernetes is evolving fast, but which tools actually matter for AI/ML workloads?

Nick Eberts, PM on the GKE team at Google, shares the three emerging Kubernetes technologies he's watching most closely and explains why they're critical for the next wave of AI infrastructure.

In this interview:

  • Why topology aware scheduling (TAS) and gang scheduling are moving upstream into Kubernetes

  • How dynamic resource allocation (DRA) goes beyond CPU and memory to manage GPUs, InfiniBand, and specialized hardware

  • Why AI agents aren't ready to autonomously control clusters — and what guardrails are needed around MCP servers and agent-to-agent communication

Nick makes a clear case: AI tools should assist, not act — at least until the ecosystem catches up.

Relevant links
Transcription

Bart Farrell: Who are you? What's your role? And where do you work?

Nick Eberts: I'm Nick Eberts. I'm a PM on the GKE team, and I work for Google.

Bart Farrell: And a Phish fan.

Nick Eberts: And a Phish fan.

Bart Farrell: Kind of a big deal, right? So, Nick, what are three emerging Kubernetes tools that you are keeping an eye on?

Nick Eberts: I do a lot of work in the multi-cluster world. So I am keenly aware of and looking forward to Multi-Kueue's integration with ClusterProfile. It's a bit specific, but the idea here with cluster profiles that I can have one list to rule them all and not have to manage 15 different lists. And so we're taking that feature to Multi-Kueue. So, for example, you can use the same cluster list and authentication plugin to manage maybe Argo and Multi-Kueue on a centralized cluster. So that's super cool. Pumped for that. Another thing that I'm pretty excited about is... Of course, it's in the AI/ML space, which I'm really excited about that. But the problem is kind of fun, which is gang scheduling and moving it into upstream Kubernetes. So there's work. It's called topology aware scheduling. So if you see TAS, that acronym floating around upstream, it's a pretty cool project in which they're trying to figure out how to deal with situations in which you need to schedule massive amounts of work, but you might not have the computers to do it. And if you don't have the computers to do it, you probably shouldn't schedule it at all. And so they're implementing that into upstream Kubernetes. It's already sort of in Kueue now today, but they're moving it, formalizing it more and making it upstream. So that's going to be pretty cool.

Bart Farrell: Now, our podcast guest, Fabián, thinks that dynamic resource allocation or DRA is fascinating in a post-AI, post-ML world and believes that it's crucial to keep an eye on how we can better schedule these types of resources. How do you see DRA evolving for AI/ML workloads?

Nick Eberts: AI/ML seemed to be the catalyst for DRA, which is an acknowledgement that, hey, we probably need to deal with more than just CPU, GPU, and volumes in Kubernetes. We have to deal with these, sorry, not GPU, memory. We have to deal with specialized hardware. We have to deal with... InfiniBand networking. There's all kinds of drivers that need to be installed and managed. And we want to be able to take all of the Kubernetes goodness and use it to manage other types of resources. So for AI/ML specifically, it's going to make sharing GPUs and using the topology aware scheduling that I mentioned earlier, a lot more efficient, but it also will kind of benefit whatever comes next or whatever technology that comes about that you might need to make a consideration around while you're scheduling the pod, right? So I think it's going to benefit, AI/ML, but eventually it'll benefit all the workloads in the world if there aren't any other.

Bart Farrell: Our podcast guest, Mai, stated that AI for Kubernetes operations is amazing, but not ready to be deployed in production without guardrails. What guardrails do you think are necessary for AI tools?

Nick Eberts: This is really important what we're talking about is troubleshooting, making changes in your clusters and your workloads. I don't think they're to be trusted yet. I do think that they provide value though, right? So using an AI assistant to troubleshoot in a read-only way or to give you hints as to what's going wrong or even tell you exactly what's going on is great. I still think we're in the mode in which you should probably have a human review it and make those changes if there are actually any changes to be made. So maybe the AI agent, its place in life if you're in a world of GitOps is making pull requests, merge requests. And then human beings need to review and prove them before they implement them into production. I don't want to see AI agents. I don't think we're ready for AI agents to autonomously control clusters right now and fix them when they think they're broken. So, not there yet. And then in the agent space where you're kind of having a scenario in which you want to host these agents for your operations team or maybe developers, there's some work to be done in securing agent-to-agent communication as well between client agent and MCP servers that may be running in the cluster or outside the cluster. And there's some work to be done on figuring out how to kind of extend capabilities that exist now with Istio authorization policies, but extend them further up the stack into agent space so you can maybe start to control what MCP servers you have access to and what types of response questions you could ask them.

Bart Farrell: Whether it's AI workloads or anything else we can expect in 2026. What are you looking forward to working on next year?

Nick Eberts: So it's fun. I'm in a unique position in which you asked me, what am I most excited about on the front end of this? And I was like, oh, these two cool technologies, TAS and Multi-Kueue. And I think that's what I'm going to be focusing on for the next year. So I'm pretty pumped to continue to work on multi-cluster problems, continue to work on multi-cluster scheduling problems, gang scheduling problems. And then I might dip my toe in reinforcement learning because why not?

Bart Farrell: If people wouldn't get in touch with you, what's the best way to do that?

Nick Eberts: Best way to get me? Go to a Phish show in the Southeast. But for real, I'm on all the social things. You can feel free to DM me. I'm always happy to help if I can. LinkedIn, Twitter, Bluesky, what have you.

Bart Farrell: If folks are looking for some good music, when will the DadBeats be performing next?

Nick Eberts: Oh my God. You know, it's kind of a big deal. We are performing at the kids, my kids. So I have two, daughters. We're performing at their teachers faculty party. And one of the songs that I have to learn is Hey Ya! OutKast. So there's nothing that's going to be more cringe than me singing Hey Ya! at a faculty party. Just read the lyrics. It's going to be fantastic.

Bart Farrell: How much do we have to pay to get the bootleg video of that, Nick?

Nick Eberts: I'll put it out there for free, man.

Bart Farrell: Cool. Well, we got DadBeats, we got Kubernetes, we got MCP, we got all kinds of stuff. Nick, always great to catch up with you. Take care. Have a good one.

Nick Eberts: All right. Take it easy, Bart. Thank you.

Podcast episodes mentioned in this interview

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via