Menu

Multi-Tenancy, Costs, and AI SRE

Multi-Tenancy, Costs, and AI SRE

May 8, 2026

Guest:

Bill Shelton

If you're running Kubernetes at scale, the hard part is rarely just cost, observability, or multi-cluster operations in isolation. The harder issue is keeping enough signal to troubleshoot quickly without creating more operational overhead.

Bill Shelton, VP of Product, Observability at Palo Alto Networks, explains how his team thinks about cost control and faster troubleshooting as the same issue.

In this interview:

Why abstraction becomes a problem during incident response
How reducing telemetry noise can lower costs and speed up troubleshooting
Why identity, observability, and SLO tracking need to work across clusters
Where Bill sees serverless, WebAssembly, and AI-assisted SRE going next

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Relevant links

Transcription

Bart Farrell: So first things first, who are you? What's your role? And where do you work?

Bill Shelton: My name is Bill Shelton. I am responsible for the product organization at the Observability Business Unit within Palo Alto Networks. This is a team that up until about two months ago was Chronosphere. We still hold on to that name. But now, under some new ownership.

Bart Farrell: So Bill, what are Three emerging Kubernetes tools that you're keeping an eye on.

Bill Shelton: Well, one, if you give me some license, it's not emerging. It's been around a little bit, but I find it super provocative because it really just gets up right. Evolving the architecture itself. Wasm is so lightweight that I'm interested to see how that plays out. That's one that I definitely keep my eye on. I am also interested in seeing what's taking place on the edge. Some unified control planes, the cordoned Tigera Calico cluster mesh, those I think are very interesting in thinking about clusters of Kubernetes that are on the edge and very large in scale. I find that super interesting tracking that always. And then the emergence of more and more agent-focused management platforms. SentinelOne is one. There's various other ones providing a larger set of, either endpoint detection, response capabilities as containers are used more and more for agents. Those three. We can go on here for a bit.

Bart Farrell: I'm sure we could. No shortage of tools in the Kubernetes ecosystem. Managed Kubernetes services keep promising to remove operational burden. At what point does abstraction stop being helpful because your team can no longer explain or debug what happened in production?

Bill Shelton: It's a great question. Here's a couple of thoughts on that. I think in the heat of the moment when you are in and you have a high severity call and maybe an outright incident and you need to work through that data to get to root cause. One of the most frustrating things is to have an abstraction layer that is preventing you from getting to that next level. So one strategy that we definitely lean on extensively at Chronosphere is not really having hard abstractions, but having aggregations that provide the ability for someone to see things at a larger level, very quickly break through the abstraction and cut through and then redimensionalize the data. We have a tool that instead of providing, it provides a guided path. We call it our DDX tool. And what it's doing is behind the scenes is running threads to actually analyze where there's correlations around the error condition. So an increase in 500s here. We also have found that we have an anomaly on CPU and various places. We also found we had an anomaly on a container level diagnostic. We will redimensionalize the data at that point to then bring focus on the intersection of all of those different dimensions that are at different layers of the abstraction that might accelerate your path to the issue. So we give you a default schema that you could walk through and you could walk from cluster and you can look at the Kubernetes services and maneuver around. But we accelerate you through that to actually find common correlations and then redimensionalize the data by those. So it's a little bit less about forcing an abstraction on people versus searching around the abstraction and then creating a new kind of model or schema for them to walk through to get to the issue. That's our thinking on it right now and one of the strategies we use.

Bart Farrell: One of our podcast guests, Fernando, describes discovering that his GKE Autopilot proof of concept was costing close to $1,000 a month due to balloon pods, minimum CPU to memory ratios, and unexpected metrics charges. How do you handle cost visibility in managed Kubernetes?

Bill Shelton: A couple of things. We do the basics. We integrate with KubeCost and try to take advantage of the ecosystem around us where people have some investments to help them manage costs. But that's really just the start of what we do. We have found a relationship. We view cost control and acceleration of troubleshooting as complementary problems. We found two sides of the same coin. So we spent a lot of time. We provide a rich tool set that I'll talk about in a moment to minimize the data you have. Stepping back for a moment, in many ways, I think we've all had this discovery where you take traditional legacy observability tools and you drop containers in. And because of the volume of the order of magnitude increase in the number of code paths that are emitting metrics because of microservice architecture and a large horizontal scale out, all of a sudden, your data volume explodes. So the first thing we do is give you a whole vocabulary to set up what we call drop rules. First thing we do is for all the data that's loaded, we then, first we track every access point to every metric. And you can imagine when you get to every high cardinality environments, very large environments, you're in with hundreds of thousands of metrics. We track every single access point to those metrics, which means that right off the bat, we use that to assign a utility score. So this utility score becomes a big part of how you can go and look at your data and right off the bat, identify all the metrics that are never used. They're not part of a dashboard. They haven't been used as part of any ad hoc interrogation over the past 30 days. This is a very low value metric drop. And we let people manage things aggressively in that way. And with this score at hand, they will be able to set up drop rules, which means, it's never even brought into the system. And other rules that we say, I want to hang on to that. But I'm only using it occasionally. I'm only using the aggregate. So when you do bring it in, here's an aggregation rule. And now we can aggregate so that we can get rid of some of those more fine grain windows to then, again, further reduce the size. So the result of this is that we find two things taking place. People are saving a large amount of money because most of the vendors out there are charging by some form of throughput or stored data. And then at the same time, we find this really accelerates their troubleshooting. You have a lot less noise to sort through. You're dealing with higher value data. So that's our approach. We load people up with a lot of tooling to help them set policies to optimize their environment.

Bart Farrell: Now, another podcast guest of ours, Molly, believes many people initially think they need their own cluster because they are scared of multi-tenancy, but they learned that the operational overhead of maintaining multiple clusters is significant. What is your experience with teams being scared of multi-tenancy versus dealing with operational overhead?

Bill Shelton: It's a good question. I'll lean back a little bit because I think there's lessons we can learn from. Obviously, clusters have been a part of our world for a long time. Kubernetes used them and many other previous technologies. And I did a fair amount of work at VMware and we studied this problem. We had the benefit of a very large customer base. We could look and see how many clusters are being used, how many were at capacity, and really start to unpack this problem. And in that context, we learned a couple of things. One of which is that frequently additional clusters on the margin are not created because the previous cluster was at capacity. It was some organizational or business issue that was driving the additional cluster. Sometimes it was the boundary on budgets or the org chart or something like this. So that's the result of it. There are going to be a lot of clusters out there. And I think that, what we adhere to, I see other vendors adhering to for some of the key management capabilities. Your identity and access management, cost management, observability, those need to rise above cluster level and be able to bring data in at scale, as well as have a schema that fully expects multiple clusters to come in and then provide someone the ability to problem solve, manage, track SLOs across clusters. So that's generally what I see playing out is that part of the industry is going to that higher level. They'll work across clusters while Kubernetes is still doing great things, mostly within cluster.

Bart Farrell: Kubernetes turned 10 years old about two years ago. What should we expect in the next 10 years?

Bill Shelton: I go back to, I already played my hand here a little bit. I think serverless and WebAssembly are somewhat lighter weight models, I think we will see I can't even imagine, but definitely over the next four to five years I think we will see more and more as agents start to play a greater role in the applications that are out there where you have very transient short-running independent processes on independent code paths I think that side of Kubernetes, the Knative and Wasm and things of that would play a much more dominant role as a proportion of the containers that are running out there, I think we'll see an increase in that form.

Bart Farrell: And Bill, what's next for you in terms of projects, events, plans? What's going on next?

Bill Shelton: We are spending a lot of time, very excited about looking at what I think the industry is framed up. I haven't seen it fully realized yet, but I believe it's out there what's sometimes called AI SRE. We don't believe at that point this is a substitutive play. Our assessment is that there is still such a rich cognitive process taking place in particular in a troubleshooting phase. We don't view this as a substitutive moment in the industry, but we definitely see it as augmentative. And we definitely see it as an accelerator in those critical times when someone is working an issue. So we have interesting projects and specialized agents that are that, upon an incident being triggered, are immediately going off and curating data, doing research, hitting previous incident reports to maybe glean insights, and bring a lot of data and then help that person process through it where we're not just heaving unlimited data onto someone and almost making their situation more stressful, but helping prioritize it, synthesize it to really optimize what is that game day experience when there's an issue that needs to be tended to.

Bart Farrell: And if people want to get in touch with you to continue the conversation and see what's going on next, what's the best way to do that?

Bill Shelton: Go to chronosphere.io. Maybe at some point that would redirect over to Palo Alto. Maybe it does today, but anyway, chronosphere.io, you'd learn about some of the work we're doing for some of our largest customers. Go from there.

Podcast episodes mentioned in this interview

Shared Nothing, Shared Everything: The Truth About Kubernetes Multi-Tenancy
with Molly Sheets
Migrating Kubernetes Off Big Cloud
with Fernando Duran

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via