Observability in Kubernetes: insights on data collection and alerting

Guest:

Stéphane Estevez

Discover the future of observability in Kubernetes environments and gain insights on optimizing your data collection and alerting strategies.

In this interview, Stéphane Estevez discusses:

The importance of not just fast data collection but also the speed of alerting and reporting, suggesting streaming approaches as the future.
Starting with OpenTelemetry before choosing observability tools, ensuring flexibility and avoiding vendor lock-in.
The necessity of simplifying management infrastructure now while acknowledging that true automation with LLMs is still distant.

Relevant links

Transcription

Bart: Who are you? What's your role? Who do you work for?

Stéphane: Hello, I'm Stéphane Estevez, working for Splunk. I'm an observability market advisor for EMEA. As you can hear, I'm French, but I represent EMEA globally.

Bart: What are three emerging Kubernetes tools that you're keeping an eye on?

Stéphane: I mean, the two usual suspects, to be honest. The first thing is about eBPF. It's always a nightmare to understand the network when you work with a cloud provider. So we have big hopes about eBPF. and understanding the network. The second one, not directly related, is open telemetry. That's maybe one of the key words that I've heard most people talking about in this edition.

Bart: One of our guests, Matt, shared that most teams store logs and metrics in Kubernetes without considering the implications of the data they collect. Consequently, they end up paying a hefty price for data that is not actually used. What's your advice on ingesting, storing, and querying metrics in Kubernetes?

Stéphane: I will not mention the cost because it depends on the solutions and so on. But I want to mention something about the metrics when we talk about Kubernetes. When you try to make Kubernetes environments observable, one of the key issues you will have is the speed at which you collect data. Most vendors have the same approach, which is to try to get the data as fast as you can when you're talking about metrics. The problem is this data is stored in a time series database, and then there's a batch process. Imagine that from the moment you see an incident to the moment you apply some auto-remediation, there's a batch process in the middle. If you collect the data, let's say metrics, in a few seconds, and then you have to wait one minute for a batch process, that doesn't make sense in the cloud-native world. This is something that most people don't talk about. They just mention the first part, which is, "Hey, I collect the data super fast." But more interesting is how fast you can alert and report on it. There are new approaches like streaming and things like that. That's what we believe will be the future for collecting metrics for Kubernetes environments.

Bart: When it comes to observability tooling, a lot of people can get overwhelmed because there's just so much out there. What's the advice that you find yourself often giving folks when they're trying to troubleshoot this vast ecosystem?

Stéphane: In fact, before even starting to talk about observability, I will start talking about OpenTelemetry. The first thing is you need to get the data, and then you will find which tools you need to use this data. Usually, an observability platform has the same tools: real user monitoring, synthetic monitoring, distributed tracing, and some kind of infrastructure monitoring with metrics, and then logs. That's a classic. The three pillars of observability are logs, metrics, and traces, and I know that soon profiling will be added. But my advice is, before anything else, just invest in OpenTelemetry. Make sure that you can properly collect the data and put that in your pipelines, so you don't have to work too much to get the telemetry done. Then you have the freedom to select the right kind of tools you want in the backend to send the data to. What I like about OpenTelemetry is first, you're not... locked in with a vendor. If you make a mistake, you can switch more easily and instrumentation is already done. If you're a security guy, you can know exactly what you're sending to whom. You can do a lot of pre-processing in there and so on. So my first advice is just start with the data. Then you will find the right tool on the backend. But everybody does the same thing. And back to my previous comment, don't trust the vendors. Just test the technology. Very often, most people do the same thing; they just don't deliver it the same way. To my point about the batch approach and so on for metrics as an example, or sampling data, which is an old way of collecting traces. Nowadays, when we're talking about AI and so on, you need to find early signals. You need the data. Stop sampling because now we're storing everything on the cloud and it's quite cheap. So there's no more reason for sampling traces and things like that as an example.

Bart: Kubernetes is turning 10 years old this year. What should we expect in the next 10 years to come?

Stéphane: There's a lot of work being done to simplify the management and infrastructure. There's a lot of buzzwords about LLMs and stuff like that for R-series and so on. I don't believe that will really happen soon. I mean, it still requires human intelligence. But still, in 10 years, we will work a lot on simplifying the management of that. But in 10 years, we'll move to something else, to be honest. That goes way too fast. But I don't know. That's the future. For the moment, we already have a lot to do to make sure that everybody starts using Kubernetes. That's a major preoccupation in the short term.

Bart: What's next for you?

Stéphane: Waiting for all the consolidation and clarification. On the observability thing, because even the word "observability" has different understandings. I'm not usually a big fan of analysts, but I think they have at least one great value: they make everybody agree on definitions. And that's what we're expecting nowadays, that people already agree on terminology. That will help simplify the conversations between everybody working around observability.

Bart: How can people get in touch with you?

Stéphane: I think LinkedIn will be the easiest way.

Podcast episodes mentioned in this interview

Foolproof Kubernetes with GKE

with Mathew Duggan