Efficient observability in Kubernetes: from data collection to troubleshooting

Guest:

Julia Blase

In this interview, Julia Blase, Product Management Lead at Chronosphere, discusses:

How to implement data-driven monitoring by analyzing usage patterns and removing unused metrics, leading to 60% reduction in storage costs while improving troubleshooting efficiency.
The shift from hero-based troubleshooting to systematic debugging by documenting expert workflows and making them accessible to all team members.
Why unified observability platforms are essential in modern infrastructure, focusing on tools that preserve context and reduce the complexity of debugging Kubernetes environments.

Relevant links

Transcription

Bart: Who are you? What's your role and where do you work?

Julia: My name is Julia Blase. I'm a product management lead and I work at Chronosphere.

Bart: What are three Kubernetes emerging tools that you're keeping an eye on?

Julia: I'm looking into Perses for data visualization on top of open source monitoring tools like Prometheus that are designed for Kubernetes. It's great to have more open source visualization tools, which makes Grafana have a competitor, and that makes everyone better in the space. FluentBit 3.2 has a lot of really important performance enhancements and efficiencies, we were watching the keynotes and Observability Day. I'm definitely keeping an eye on that. OTel, I know it's been around for a while, but it keeps maturing, and we just keep seeing more and more customers demand open telemetry for metrics and traces. I think that's going to continue to grow.

Bart: One of our guests, Mat, shared that most teams store logs and metrics in Kubernetes without considering the implications of the data they collect. Consequently, they end up paying a hefty price for data that is not actually used. What's your advice on ingesting, storing, and querying metrics in Kubernetes?

Julia: Absolutely. My biggest piece of advice is to only keep the data that you need, and do that in a data-driven way. You can look at all the data you collect and see where people are using it - in a monitor, in a dashboard, if people are writing individual queries, or if your API is calling that data all the time. You can also see what's never getting called. When we talk to customers, sometimes they end up cutting their data by up to 60% when they work with us because we give them some of these tools and insights. However, you can do this even without Chronosphere. You can go in there today and see what people are never accessing and drop it. Not only does that save you a ton of costs, but it also saves people a ton of time debugging, as developers aren't digging through a ton of dashboards and line charts and entropy where people have tried and failed to capture what they needed. If you just get rid of all that data, they can get the insights they need much faster. This approach saves both cost and time if you can really analyze what people are not using and just get rid of it.

Bart: Troubleshooting tips and the learning path. Our guest, Julia Blase, spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. She stressed the importance of learning while troubleshooting. Is there any practical advice that you've learned during the years regarding debugging?

Julia: Yes. What I've learned over and over is that you cannot rely on your experts to handle debugging tasks. Our services are complex, and Kubernetes introduces microservices, rapid rates of change, infrastructure scaling up and down, and user volume fluctuations. As things change quickly, developers become specialized in one endpoint or one service. They end up calling in experts, often referred to as "heroes." I've heard customers use this term, and I'm sure many listeners can think of who the hero is in their organization. However, these heroes burn out. They're on every incident call, which is exhausting. They're not spending their time using their expertise to build fantastic code and great features for customers.

To address this, we need to take the heroes' workflows, document them, make them repeatable, or even embed them in the product. This way, anyone, including brand new developers, contractors, or those unfamiliar with the system, can follow these workflows and do the work of the hero. This allows teams to focus on building great code and new features instead of spending a lot of time troubleshooting.

Bart: Alerts, visualization, monitoring, and alerting. Our guest Miguel discussed how visualizations can help consume and make sense of vast amounts of data, aiding in better decision-making and system management. What's your thought process for collecting, alerting, and visualizing data on a Kubernetes cluster?

Julia: I think that you absolutely have to have all of that information relevant to your system in a single place. You have to have the right context and you have to have it at the right time. Any visualization tool that says, "I can only look at this type of data over here," and then you have to make multiple hops, apply new filters, and go back to find that specific slice in time to see this other data, is going to lead you into failure. Or at least it's going to lead you into spending a lot more time trying to understand your system, fine-tune your performance, or debug an incident. You need visualization tools that give you the right set of data in the right place and preserve your context as you go deeper into that data. I don't want to have to write a new PromQL query every time I want to investigate a specific nuance of the feature I'm looking into today. I need easy-to-iterate-on visualizations that help me go deep and provide fast and performant results on the high scale of my Kubernetes data. Otherwise, I'm going to get frustrated because I don't want to spend time in the data visuals. I want to build code. The visuals are there to help me, and they should be powerful enough to do that for me today.

Bart: Kubernetes turned 10 years old this year. What should we expect in the next 10 years to come?

Julia: I think Kubernetes introduced us to a world where infrastructure can be dynamic and scale with load. In the observability space, I see monitoring and observability needing to be as dynamic as infrastructure. This means being able to scale up and down what you've instrumented and captured on demand, when needed, to save costs and get the insights required at the right time.

Bart: What's next for you?

Julia: Well, we have just launched a big product, actually, Differential Diagnosis, and that's been kind of my baby. My team and I have been working on it very intensely with all of our customers for the last several months. I'm excited to see it go live and get feedback. My husband and I have a three-week trip to South America planned, which I'm looking forward to, as I've never been. I'm looking forward to taking a little break before getting back to help customers solve problems.

Bart: How can people get in touch with you?

Julia: I'm on LinkedIn, Julia Blase.

Podcast episodes mentioned in this interview

Foolproof Kubernetes with GKE
with Mathew Duggan
Troubleshooting a validation webhook all the way down to the kernel
with Alex Movergan
The basics of observing Kubernetes: a bird-watcher's perspective
with Miguel Luna