Making sense of Kubernetes observability at scale
A deep dive into observability and monitoring challenges in Kubernetes environments, with insights on scaling and simplifying telemetry data collection.
In this interview, Roman Khavronenko, VictoriaMetrics co-founder, discusses:
Why observability is crucial for Kubernetes operations and how it helps teams understand system behaviour through metrics collection, alerting, and visualization.
The technical challenges of managing time series databases at scale, particularly the impact of high churn rates where metrics are frequently created and destroyed
The need to simplify Kubernetes monitoring, highlighting how only 25% of collected metrics are used while the remaining 75% consume resources without providing value.
Relevant links
Transcription
Bart: Who are you? What's your role? And who do you work for?
Roman: Hello, my name is Roman Khavronenko. I'm an engineering manager at VictoriaMetrics. Victoria Metrics is the company behind the open-source project of the same name, VictoriaMetrics, which is concentrated on observability, providing a time series database and a bunch of tools for collecting metrics and delivering metrics.
Bart: What are three Kubernetes emerging tools you work with at night?
Roman: I'm not a big fan of Kubernetes, but I have to work with it because of where the monitoring applies mostly right now. I really like distributed systems and observability, and all the problems connected with that. It turns out that both of these aspects can be applied to Kubernetes. This is how I am related to Kubernetes - I like to monitor stuff that runs in Kubernetes and solve problems.
One of our guests, Miguel, discussed how visualization can consume and make sense of vast amounts of data, aiding in better decision making and system management. My thought process is that collecting, alerting, and visualizing data on a Kubernetes cluster is crucial. Observability gives you an understanding of what's happening to services in Kubernetes and what's happening in Kubernetes itself. Without observability, it's like a black box - the best you can do is check logs or maybe ask your users if they're okay with the quality of service.
Observability opens a window, making the processes transparent. You can check how the system behaves at any moment in time, right now or in the past. Depending on what metrics you collect and how you collect them, you can get a better understanding of what's happening. You can see what's happening, why errors start to pop up, and what events they're correlated with.
Visually representing this data in tools like Grafana is also very important. You can open a dashboard, see what's happening, and share it with your colleagues or other engineers, so you're all on the same page. However, having alerting is even more important, as it's not dependent on a human looking at the dashboard. Alerting can catch your attention immediately if something is wrong, even when you're not awake or not looking at the dashboards. Alerts can notify you and drop your attention to the problem.
Bart: Also, explain that while monitoring deals with problems that we can anticipate, for example, I'm just running out of space, observability goes beyond that and addresses questions you didn't even know you needed, such as analyzing telemetry signals. Does this statement match your experience in adopting observability in your stack?
Roman: If you have metrics that mean something, like those that describe the internal processes of your application, you can draw better conclusions or new conclusions that you couldn't have anticipated before. For example, you can spot weirdness in those metrics, such as backups slowing down for unexpected reasons. By looking at those metrics, you can form a guess or even a theory, and then by adding new metrics, you can prove or disprove why that slowdown happened or why any other process behaved differently.
I recall a case from a previous company where a Kubernetes service that ran and served a ton of requests constantly experienced latency degradation for unknown reasons at a specific hour on certain days. Thanks to the metrics we collected from that system, we discovered that the degradation was caused by the fragmentation of SSDs on that instance. Without those metrics, we would never have guessed why the service degraded. However, the correlation between the degraded service and other anomalies that happened at the same time gave us a clue. We noticed that these two systems had anomalies at the same time, which led us to suspect a correlation and eventually understand why it happened.
Bart: While maintaining 900 nodes and a 15,000-pod cluster, one of our guests, Ferris, explained their transition from Prometheus to Thanos, and eventually their decision to adopt VictoriaMetrics for metrics. What's your experience in monitoring and managing large-scale customers?
Roman: I have a good experience with this topic because we're working on a time series database that's supposed to solve monitoring at scale problems. This article proves that our work on the VictoriaMetrics project is effective, as people choose it for better reliability or scalability. In my experience, Kubernetes is expensive in terms of telemetry data due to its model and frequent label changes. There's a term called "churn rate" in metrics, which means that some time series data exposed by Kubernetes services and Kubernetes itself don't live long enough. These time series are constantly being created and dying, resulting in a high churn rate. This makes it hard for time series databases to process these metrics properly, as they maintain inverted indexes that map unique time series to their actual IDs in the system. If you have a high churn rate, these indexes consume more memory than expected. This issue is related to Prometheus, Thanos, and their relatives, as well as VictoriaMetrics, because they all maintain inverted indexes. The difference between these systems is that some are more efficient at handling this issue or have better techniques to tackle it. In general, monitoring Kubernetes services is expensive due to the churn rate and frequent label changes. If there's a way to avoid this churn rate, you would save a lot of resources on monitoring and wouldn't need as much memory.
Bart: Kubernetes is turning 10 years old this year. What should we expect in the next 10 years to come?
Roman: My internal philosophy and engineering philosophy at VictoriaMetrics is to prefer simplicity above all else. I expect that Kubernetes will simplify the way you work with its services, API, and especially simplify monitoring. Currently, we observe, and other vendors report, that most metrics exposed by Kubernetes are not used anywhere. If I'm not mistaken, Grafana reported that only 25% of the metrics exposed by Kubernetes services are actually displayed in dashboards or other tools, while 75% are just dead weight that needs to be collected, stored, and resourced, which is not ideal. I expect Kubernetes to become more streamlined in this approach. Maybe this dead weight of metrics is not really needed. Maybe we can create better exporters for those metrics, and these exporters will expose only meaningful metrics - not everything that describes Kubernetes processes, but something actionable for users. Something that can be turned into an alert and rule, or turned into meaningful panels in Grafana, etc. If we can create a new standard for how metrics should be exposed or Telemetry signals that Kubernetes needs to expose in a meaningful way, we can probably cut this complexity for monitoring Kubernetes. Everyone will benefit from that. Observability vendors will have to spend less resources on collecting those metrics. Users will spend less money because they won't need that many resources for on-premise solutions, etc. And it would be just much easier to navigate support and learning tools.
Bart: What's next for you?
Roman: For us, we want to create a logs database that is Kubernetes-native and applicable to all services running in Kubernetes in a streamlined way. We have VictoriaMetrics, a time-series database that is already compatible with Kubernetes service discovery and metrics exposition format, so it can be used by default for collecting all metrics from Kubernetes. We want to do the same for logs because, from my perspective, there are three main telemetry signals: traces, logs, and metrics. Logs and metrics generate the most volume. Currently, we support metrics, and we want to support logs because logs are probably the first priority in terms of data volume that needs to be transferred, processed, and stored.
If people want to get in touch with us, the best way is to open a ticket on VictoriaMetrics GitHub. You can also check our VictoriaMetrics Documentation on VictoriaMetrics GitHub or our website to find our VictoriaMetrics Slack community, Stack Overflow community, or email us - all those links are in our VictoriaMetrics Documentation. We have about three thousand people in VictoriaMetrics Slack, so I guess that's the most popular platform right now. Just join our public VictoriaMetrics Slack, ask questions, and we will be happy to help.
Bart: Thank you very much for your time today.