AI agents and Kubernetes: from troubleshooting to infrastructure

Dec 10, 2025

Guest:

Mike Stefaniak

This interview examines the intersection of AI tooling and Kubernetes operations, as well as emerging trends that are shaping the container orchestration landscape.

In this interview, Mike Stefaniak, Head of Product for Kubernetes and Registries at AWS, discusses:

Enhanced security and open source trust as differentiators, the resurgence of service mesh communication for multi-agent cluster environments, and the need for fine-grained authorization to support AI agents taking automated actions
AWS's open-source MCP server for context-aware cluster insights, the current limitation of read-only AI operations due to security concerns and hallucination risks
A vision where clusters become invisible infrastructure, similar to how Linux faded into the background, with platforms handling cluster management so developers can focus solely on application deployment

Relevant links

Transcription

Bart: So, first things first: Who are you? What's your role? And where do you work?

Note: In this transcript, there are no specific technical terms that require hyperlinks based on the provided LINKS table. The question is a straightforward introduction request.

Mike: My name is Mike Stefaniak. I'm a Senior Manager of Product Management at AWS. I cover EKS, Elastic Kubernetes Service, and ECR, Elastic Container Registry, which I consider the unsung hero of container services. Every time you're pulling an image in your cluster, you're using ECR. These are the two services I cover, as well as various open source projects that we contribute to.

Bart: What are three emerging Kubernetes tools or trends that you're keeping an eye on?

Mike: I've noticed three key trends at KubeCon. First, security and open source will be a differentiator, especially when anybody with a credit card and an IDE can write code. The projects that are trusted and secure will make the difference versus those that anyone can just put together in a night.

Second, service mesh communication is starting to make a comeback. It seems to ebb and flow, but the fact remains that you might have multiple agents running in a cluster that need to communicate with each other. I've seen some projects and talked to people discussing how to route models and find the best way to do that.

Third, when you have MCP tools and agents taking action, you need more fine-grained authorization in Kubernetes and supporting projects. I really don't think the current level of authorization in Kubernetes and its supporting projects is good enough to secure based on the actions an agent might take.

Bart: How might an AI assistant or language model tool get live, context-aware information from a Kubernetes cluster? For example, retrieving information about pods, events, and metrics. What are the technical challenges in building such a system?

Mike: At EKS, we open-sourced our MCP server back in April as a preview in AWS Labs, which you can take and run yourself. We've learned a lot and received significant feedback. However, the current version only has access to local troubleshooting.

When trying to debug an issue, it might be a problem in the cluster or a supporting service. Perhaps a different dependency has gone down, or the issue isn't active within the cluster itself. You have to follow a chain of actions to figure out what's happening.

We're working on hosting the MCP server to provide access to internal metrics that support agents and teams might have—information not available when running it locally. Our goal is to create an AWS troubleshooting agent that provides comprehensive visibility, giving you not just local cluster insights, but the entire AWS ecosystem.

We've started this initiative, but we still have substantial work ahead to make it easier to discover the data sources needed for effective troubleshooting.

Bart: What are the trade-offs in security, consistency, and control when giving an assistant the ability to manage Kubernetes resources, creating, updating, deleting, rather than just observing them?

Mike: For the most part, from users and customers I've talked to at this conference, everybody's in read-only modes. I haven't talked to anybody who's completely hands-off and willing to let the agent troubleshoot and take immediate action. Customers have agents that take the first crack: a ticket gets fired, they do an initial investigation, give their synopsis, maybe send a Slack message, and then a human takes over.

They're very much in a read-only production state. This is likely because of the security challenges I mentioned before—things aren't fine-grained enough. Many users and customers are hesitant to let AI take over. Hallucinations are another challenge that isn't fully solved yet. I've heard some techniques here that people are working on, but for the most part, we're at a read-only phase.

I think getting better at removing hallucinations and creating more fine-grained security will be the bar that has to be hit before people start using these tools and taking mutating actions.

Bart: In a multi-cluster, multi-environment setup, how could you design a unified interface so that tooling or humans can operate across clusters in a consistent way, without requiring custom scripting or multiple Kubernetes configurations? What patterns or abstractions would help?

Mike: The local version we released of our MCP server doesn't solve that problem because it requires the user to set up their kube config and point to particular clusters. It's a starting point where we've learned a lot, but we knew it wasn't the solution we actually wanted to build. Once we host and run it in AWS, we're going to have context on our side of all the clusters you have across accounts and regions. The difference is between running locally, where you have to set up everything yourself, versus AWS hosting an agent and MCP server. This is when it becomes really powerful to troubleshoot across your entire fleet, not just a single cluster.

Bart: Kubernetes turned 10 years old last year. What should we expect in the next 10 years?

Mike: I personally hope that we're not talking about clusters anymore in 10 years. Just like Linux has kind of faded into the background, you don't really talk about Linux anymore when creating a set of Kubernetes clusters, which are most often running Linux. I hope Kubernetes just becomes the next layer in the stack. In EKS, we're thinking about what is that next layer—can you just bring us your application? We'll go figure out the clusters to run for you. Clusters are not something I want our end customers having to think about in the next 10 years.

Bart: What's next for you, Mike?

Mike: re:Invent is two weeks away. KubeCon is interesting timing for us at AWS. When I'm not talking to people at the conference, I'm going back and checking my laptop to make sure our launches are on track. It'll be a busy next week leading up to re:Invent.

Bart: And how can people get in touch with you?

Mike: I would say LinkedIn is one way, and that's probably the easiest. The other option is our open source containers roadmap. I check every single comment and issue opened on there every morning. Another way you could get in contact is by opening issues and leaving your feedback on EKS or other container services.