AI-Powered Incident Response on Kubernetes

AI-Powered Incident Response on Kubernetes

Mar 31, 2026

Guest:

  • Greg Eppel

When a Kubernetes incident hits, how long does it take your team to trace the root cause across metrics, logs, traces, pipelines, and pull requests?

Greg Eppel, Principal GenAI Specialist at AWS, explains how the AWS DevOps Agent acts as an agentic orchestrator that builds a knowledge graph of your entire system — and uses it to automatically accelerate root cause analysis.

In this interview:

  • How OpenTelemetry, Cilium with Hubble, and Karpenter work together for deep Kubernetes observability

  • What AWS DevOps Agent actually does — topology mapping, signal correlation, and code-to-cloud investigation

  • The proactive mode: detecting recurring patterns across incidents to improve long-term operational excellence

The future of Kubernetes operations isn't humans reading dashboards — it's agents doing the work and bringing humans in only when needed.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Transcription

Bart Farrell: So who are you, what's your role, and where do you work?

Greg Eppel: so Greg Eppel. I work at AWS, and I basically run the go-to-market for AWS DevOps agents. Been with the company about 10 years. Was a principal SA before that. Focused a lot on observability and cloud operations.

Bart Farrell: Fantastic. What are three emerging Kubernetes tools that you're keeping an eye on?

Greg Eppel: I would probably say the number one is OpenTelemetry I think it's just a really important piece of observability. It's finally bringing together metrics, logs, and traces. into one solution. DevOps agent uses a lot of that in terms of root cause analysis. Probably the second one would be Cilium with Hubble. I just think that deeper insights into the actual runtime is really important in terms of observability. And then probably the third would be Karpenter. I think just in terms of once you've actually observed what's gone wrong in your infrastructure, in your Kubernetes environment, you have to you basically get that system up and running as quickly as possible. And so I think those three things just tied together really naturally in terms of operations and responding to incidents.

Bart Farrell: What is AWS DevOps Agent and why does it matter for Kubernetes operations?

Greg Eppel: so AWS DevOps Agent think of it as like an observability orchestrator. So it's basically looking at, it's building a topology, really a knowledge graph. And if you think about Kubernetes, there's a lot of complexity within Kubernetes, a lot of service-to-service and pod communication. The DevOps agent itself is just able to take a massive amount of relationships, graph them together, and really accelerate root cause analysis. So it's really important for very distributed systems and highly ephemeral workloads.

Bart Farrell: How does AWS DevOps agent handle incident investigation in containerized environments?

Greg Eppel: so like I was saying, DevOps Agent is really an orchestrator, an agentic orchestrator across a bunch of tools. But the fundamental piece is building that topology and that knowledge graph. And so in the containerized environment, when you have a lot of ephemeral resources, a lot of changes happening at different levels, DevOps Agent is basically able to stitch those relationships together. And using that graph, accelerate the root cause analysis. So it's very time consuming for a human to look at metrics and logs and traces and the different events happening within Kubernetes. And we're just able to use the DevOps agent to basically accelerate that. So it's establishing those relationships, traversing that graph and get to root cause much quicker than a human ever could.

Bart Farrell: What integrations does AWS DevOps agent support for DevOps tool chains?

Greg Eppel: so we support a variety of DevOps tools within the tool chain. Right out of the box today, we support Dynatrace, Splunk, Datadog, New Relic. In terms of communication channels, we hook into things like Slack, so you can get updates within Slack around your investigations. We also integrate directly into ServiceNow, and so we'll keep tickets up to date, or we can launch an investigation based off a ticket. We also look at your CI CD pipelines and codes. We hook up to GitHub and GitLab. And what that basically means is we can do a code to cloud investigation. And so we can look at your Kubernetes cluster running in prod. We can look at, basically trace that back through the pipeline, back to the actual pull request and make that correlation basically across the whole SDLC. So it's a pretty powerful thing. It's not just looking at what's in prod, it's going all the way back to the code that was deployed basically to production. Obviously, we're going to add more integrations, but what we also give customers is the ability to bring an MCP server. So we have a series of integrations for the CI CD tool chain, but if we don't support something that you have in your toolbox, you can basically bring that by MCP and bring that context directly into the DevOps agent.

Bart Farrell: And how does AWS DevOps agent prevent future incidents proactively?

Greg Eppel: so this is really the second capability within DevOps agent. Outside that investigation and doing that investigation for a particular incident, we look at basically all the investigations that have happened over the past week. We do this automatically, or you can run this process ad hoc. And we're basically looking for patterns. We're looking for things that keep happening. Maybe there's resource contention issues in your EKS cluster that keep happening on a reoccurring basis. There's a pattern there, basically. And so that preventive capability is looking at those patterns and then give you a longer term basically operational improvement that you should make just to really increase the resiliency and the performance over time. So it's not just looking at point in time issues, it's looking at those patterns over a period of time, and then basically giving you the recommendations or the prescription on how do you actually improve your operational excellence.

Bart Farrell: And how does AWS DevOps Agent fit into modern cloud-native architecture?

Greg Eppel: Basically, it fits into your modern cloud-native architectures because it's looking across basically the whole tool chain. And like I was saying before, it connects into your code repos, into your pipelines. It's looking at those metrics, logs, and traces that you have within your Kubernetes platform. It's basically stitching all those signals together, establishing those relationships. Typically taking a very distributed system and just simplifying that through agentic AI.

Bart Farrell: Now, Kubernetes turned 10 years old a couple of years ago. What should we expect in the next 10 years to come?

Greg Eppel: I think with the agentic AI and with the DevOps agent, we're going to see more autonomous operations, right? So you're really going to see, I think, within the Kubernetes platform and where this will evolve is you're going to see a simplification of operations, basically, and more autonomous actions basically being taken on behalf of the agent. And really, the human gets brought into the loop only when absolutely necessary. So I think it's just going to overall simplify the management and the operations of running Kubernetes overall.

Bart Farrell: And what's next for you, Greg?

Greg Eppel: I'm going to continue to work on the DevOps agent. And basically, my team also covers agent core. So I'm really involved within both running and creating agents in AWS, but also observing them and operating them with using the DevOps agents.

Bart Farrell: And how can people get in touch with you?

Greg Eppel: so I'm on LinkedIn. You can just search for my name, Greg Eppel, and I'll just show up basically on LinkedIn.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via