Black box vs white box observability in Kubernetes

Black box vs white box observability in Kubernetes

Host:

  • Bart Farrell

Guest:

  • Artem Lajko

Platform Engineer Artem Lajko breaks down observability into three distinct layers and explains how tools like Prometheus, Grafana, and Falco serve different purposes. He also shares practical insights on implementing the right level of monitoring based on team requirements and capabilities.

You will learn:

  • How to implement the three-layer model (external, internal, and OS-level) and why each layer serves different stakeholders

  • How to choose and scale observability tools using a label-based approach (low, medium, high)

  • How to manage observability costs by collecting only relevant metrics and logs

Relevant links
Transcription

Bart: In this episode of KubeFM, I had the chance to speak to Artem, who has been on the podcast previously, discussing surviving multi-tenancy in Kubernetes. In this episode, we focus on the intricate challenges of observability in a Kubernetes environment, examining why traditional observability frameworks often fail in the context of Kubernetes. We delve into the inherent complexity of Kubernetes, the limitations of relying solely on metrics, logs, and traces, and the necessity of context-aware observability tailored to dynamic, ephemeral work. Our conversation touches on the relationship between observability and control planes, the impact of Kubernetes' distributed architecture on data collection, and the tools and techniques needed to improve system visibility and reliability, such as Prometheus, Grafana, Loki, and Jaeger. This episode is sponsored by LearnK8s, a training organization that has helped engineers level up since 2017. LearnK8s offers instructor-led trainings, both online and in-person, which are 60% hands-on and 40% theoretical, and students have access to the material for the rest of their lives. For more information, visit LearnK8s.io. Now, let's get into the episode. Artem, welcome back to KubeFM. It's a pleasure to have you with us. First question: which three emerging Kubernetes tools are you keeping an eye on?

Artem: Hey Bart, thanks for having me again. Another day, another topic. I'm happy to be here. If I remember correctly, the first time I mentioned Kyverno, CertManager, and ArgoCD. They are still my top choices, but I've been trying to replace ArgoCD with Sveltos for some customers. This year, I discovered Sveltos and had a lot of struggles with ArgoCD, particularly when dealing with a large number of applications. Sveltos is an add-on controller that allows you to deploy applications across different clusters, similar to ArgoCD, but based on cluster profiles. The biggest difference is that Sveltos deploys agents on the managed clusters, and the agent notifies the management cluster if something goes wrong or if there's a delta that needs to be synced. Sveltos also allows you to define custom templates and patches, which I found useful for a customer who needed to implement Pod Security Standards. ArgoCD does not allow you to install Helm charts or template and patch them, but Sveltos does. For example, you can template a Helm chart and then patch it with customizations. So, Sveltos is a good alternative for some customers, and I've been experimenting with it.

Bart: Now that we know you were on a previous episode, can you tell us a little bit more about who you are, what you do, and where you work?

Artem: My name is Artem Lajko. I currently hold the title of Platform Engineer, which has changed over time. I have also recently earned the title of Kubestronaut by obtaining the five necessary certificates this year. My goal is to build open-source solutions that simplify people's lives. I achieve this by packaging solutions into manageable blocks and generating traffic for these tools to encourage contributions. Currently, I work as a Platform Engineer at IITS Consulting, a service provider that specializes in software development, platform engineering, and artificial intelligence. We focus on real AI, not the hype that started a couple of years ago, and our team includes actual AI engineers and PhD holders.

Bart: And in your case, How did you get into LearnK8s?

Artem: It started at the university. I had a project to make GPUs usable for different users for deep learning. When I came into contact with Docker, I used it to write a wrapper script, making it possible to utilize Docker. After that, I worked with Minishift, OpenShift, and managed Kubernetes, and somehow stayed in the Cloud Native world, feeling like a prisoner. This experience is similar to Stockholm syndrome.

Bart: Good. What were you before cloud native before the Stockholm syndrome kicked in?

Artem: I was born into the cloud native world. I was a student in the hardware engineering field, working with microcontrollers, FPGAs, and other topics that involve limited memory and handling it. Then, I joined the cloud native world, often referred to as the 'dark side'. I didn't come from a software engineering background, but rather transitioned directly from being a student into the cloud native world.

Bart: Now you recently changed jobs. In your previous role, you were fighting [multi-tenancy](What is multi-tenancy in the context of Kubernetes?). How does that compare to your current role?

Artem: It depends a lot on the customer I'm working with, and they need challenges to be solved. One current topic I'm working on is the internal developer platform, which is a hot topic. The question is how to build it, whether it's necessary, and what value it would generate. I'm also exploring how to integrate AI into the everyday life of developers to solve various challenges and troubleshoot new infrastructure. We have our own product for this, as German companies can use OpenAI, but then they need to integrate a sovereign solution. We try to integrate it into the everyday work life. For example, if you have cluster issues, tools like HolmesGPT can help. It needs a backend, so you can ask for help with your cluster, and it will identify the issues you need to solve. We provide the backend based on your documentation, as ChatGPT doesn't know your infrastructure, but you can provide it. This is also something I'm working on, and it's really fun because it's fast and the answers are very good.

Bart: Now, as part of our monthly content discovery, we found an article you wrote titled "Black Box vs White Box Observability in Kubernetes". We want to dive into this topic deeper. Let's start with the basics. Observability is a term often heard in the Kubernetes world, but it seems to mean different things to different people, despite being frequently mentioned. How would you define observability and why is it so crucial in modern systems?

Artem: This is a good question, and I think it's tough enough to crack. In my opinion, observability is a buzzword that is often used when you don't know exactly what you want to cover, similar to GitOps or DevOps. I try to describe it as the ability to see what happens in your cluster through metrics, understand why it happens through logs and traces, and learn how to prevent it through profiling, which is an advanced topic that many people are not familiar with. Observability also informs you when something goes wrong, which is the alerting part. This is how I explain observability when someone asks me what it means to me. The key points I want to achieve with observability are understanding what's going on in my system and being able to prevent issues. However, many people think that tools like Grafana and Prometheus are enough, but they are not.

Bart: Bart Farrell: You mentioned Prometheus and Grafana. There is a common perception that setting up these tools is sufficient for observability. Can you explain why this might be an oversimplification?

Artem: If you deploy Grafana on Prometheus, it only allows you to see what's happening in the system. You only have metrics and a default dashboard, or maybe some default dashboards. So, you may see that the CPU utilization of your application was high, and then you don't know the cause or the context. You just have a grid dashboard with metrics that says CPU utilization was high. I think this does not solve any problems. Because of that, it's an oversimplification if you're just using Grafana and Prometheus for most use cases. Of course, there are edge cases where they work, but for most use cases, you will not create value.

Bart: In your article, you introduce a three-layer observability model. Could you walk us through these layers and explain why you've structured it in this way?

Artem: I've described the top layer as an external layer because it's where external calls happen. For example, to see if a website like Google is still available when I call it, this is the external layer for me. Then I described the second layer as internal because it represents a view of the application, including logs, metrics, and going more inside. I also mentioned OS-level observability, which focuses on monitoring system calls and kernel-level events, and can be achieved with tools like eBPF and Falco. This is crucial for security and deep performance insights for different teams.

When a service owner comes to me and says they want observability, I know they usually mean the accessibility of their application from the outside to fulfill SLAs, which is the external observability layer. When a developer comes to me and says they want observability to see when their application is no longer running properly, I know they often mean they want logs and metrics. This is the internal layer.

If someone from the platform team or security team comes to me and says they want observability to see if unwanted activities are being carried out, I know they mean something at the OS level, such as monitoring system calls. As you can see, everybody comes to me and says they want observability, but I need to make a difference in how I speak with different stakeholders. Establishing this model for myself has saved a lot of time and helps me understand their needs and deliver the right solution.

Bart: Let's take a closer look at the first layer, external observability. What does this layer encompass and why is it considered separate from internal monitoring?

Artem: For me, the focus of such tools is monitoring the availability and performance of application services from the perspective of an external user. For example, when opening a site like Google, it has internal checks for availability, checking whether the service is reachable, and response times are a key metric. Measuring how quickly the application responds to a request is crucial. To fulfill this, tools like Uptime Kuma can be used. Uptime Kuma is a user-friendly, free tool that offers a wide range of checks, including HTTP, TCP, pinging a site, and resolving a DNS name. Our checks using Uptime Kuma may involve regularly pinging a website, like Google, every few minutes to verify its availability from the outside, measuring response time, and triggering an alert if it's not available or if the user certificate is expired. This is what I mean by the external observability layer.

Bart: You mentioned the importance of working with different stakeholders, different teams within an organization, who often have varying interests in observability data. Who are the primary stakeholders for external observability and how does this layer serve their needs?

Artem: For external observability, it's a good question. I can tell you about the stakeholders I deal with. Usually, these are SRE teams interested in improving application performance, such as response times. Then there are service product owners who want to understand the overall user experience, particularly how often and for how long users face access issues, in order to fulfill SLAs. I also have a lot of contact with customer support, which includes service desks. They want to stay informed about current outages or performance issues so they can react faster if a customer calls or creates a ticket. The stakeholders I work with most at this level are IITS Consulting.

Bart: Moving on to the second layer, internal observability seems to be where most traditional monitoring tools operate. Can you elaborate on what this layer covers and the tools typically used, such as Prometheus, Grafana, Loki, and Jaeger?

Artem: I focus on system-generated logs, such as data logs, metrics, and traces from within the application infrastructure, offering a data view of how services operate and interact. This is what I define as internal observability. To illustrate this, I can mention some tools. For example, when speaking about metrics, Prometheus can be used for gathering time series data on resources like CPU and memory usage, enabling teams to support performance trends. When speaking about logs, Loki, as part of the Grafana stack, can be used for collecting and querying log data, enriching it to help identify errors and failures in specific services. For tracing, Jaeger can be used to follow up on requests to different services, providing insights into dependencies and latency issues. This is how I define the internal observability layer, and these tools are used for this purpose. Combining these tools allows you to see, for example, that the CPU is currently high, and the logs will show that there are many requests being processed. Then, if you look at the tracing, you can see why network packets are being lost. In my opinion, combining all these tools provides a good observability stack, or internal observability stack, which we call an observable stack. If you are smart, you can put it into a Grafana dashboard, working together, and see the connections at a glance. For example, when an alert comes in, you only need to look at the dashboard, and it helps you troubleshoot. As I explained, if the CPU is high, you can see it in the logs and then see where packets are lost. This is how I define internal observability, and I think most people also think about this layer when speaking about observability.

Bart: Tracing is a relatively new concept in the observability space. Some argue that it's just glorified structured logging. What's your take on this, and do you think tracing is necessary if we already have comprehensive logging, perhaps using tools like Loki or Jaeger, a distributed tracing system.

Artem: To be honest, I don't have much experience in this area, as it's a very new concept for most companies. From my understanding, while tracing and logging both capture details about application events, tracing offers unique insights into the flow of a request across distributed systems. This makes it more than just structured logging. Unlike logging, which records individual events at specific points, tracing follows a single request as it moves to different services, providing a detailed view of latency and dependency in a chain of requests.

For example, in some projects, the response time for services was extremely slow. The slow process could be caused by delays anywhere in the chain, and without tracing, you are potentially blind to these issues. In one instance, we were lucky to find that the checkout process in a shop was the cause of the delay. We discovered this by examining database logs, which showed that the query was inefficiently written. With a new data record, the total response time in the chain was extended. However, if we had the ability to track the request, we would have seen that the request was delayed at the database level, and the response time increased compared to before the data was added.

With tracing, it would be easier to identify the root cause of such issues. I believe tracing makes a lot of sense, but it also requires a lot of practice and expertise to implement correctly in a company.

Bart: Data volume is a significant challenge in observability. Some experts suggest that the cost of storing and processing observability data can exceed the cost of running the actual application. How do you approach managing this vast amount of data, possibly using tools like Prometheus, Grafana, Loki, or Jaeger?

Artem: I totally agree that collecting metrics and logs costs a lot of money and increases data volume. In many projects, companies collect metrics and logs over weeks and months, which can be costly. What I find interesting is that 95% or partly 99% of the logs and metrics are collected but not useful. I often ask myself, what is the use of metrics from a month ago, or why is the entire stack logged when I need to troubleshoot something and can't find the relevant application logs? If you can't get an answer to a simple question like this, then you usually know that you can compress the metrics or keep them for only two or four weeks and throw them away. Because if people say they are using metrics from a month or two months ago, you can compress and throw them away, reducing costs.

I have done this twice in a project, separating application syslogs from application user logs. User logs can be kept for several months due to regulatory requirements, which can be three, four, five, or six months or longer. However, syslogs from the application, not the system, are not necessary and can be thrown away after fewer than three months. But you also have to be careful and know the domain. For instance, when working for the Hamburg Port Authority, they have sensors that measure water levels, providing metrics used to draw historic graphs over years. If you throw these metrics away, you will have a problem. So, you must try to save them separately and understand your domain.

In most cases, metrics like CPU and RAM utilization can be thrown away because they do not provide significant value. If you ask what the CPU utilization was a month ago, it may not be relevant. But there are cases where you need to keep an eye on certain metrics, such as those used for historic graphs, which can be achieved with tools like Prometheus, Grafana, and Loki.

Bart: Given the complexity and potential costs discussed, some might argue that comprehensive observability is overkill for smaller teams or projects. Do you think there is a minimum viable observability setup, perhaps just Prometheus and Grafana, that would suffice in these cases?

Artem: I completely agree with the statement. The worst thing is when you roll out the entire monitoring stack and nobody uses it, wasting a lot of resources and money. This is also a challenge. Every time you implement observability, you will be challenged. We chose a label-based approach to address this issue. There are three levels of labels: low, medium, and high. With the low label, we only deploy Grafana and Prometheus stacks, which are sufficient for many projects, teams, and applications. If IT needs more, such as logs and alerting, we have a medium label. When a cluster is labeled as medium, logging over Loki and alerting over Prometheus Alert Manager or other tools will be rolled out. If users need tracing and understanding, we assign a high label, and Jaeger will be deployed. This approach allows us to scale and dynamically adapt the observability stack using GitOps with ArgoCD, for example. We can deploy the stack based on labels, which is a nice approach that works well. We only provide teams with what they need or understand from an ability standpoint. If they increase their knowledge, we can assign another label, such as increasing from low to medium, and they can collect logs. This approach is working very well. I like it, but I am open to learning about better approaches if they exist.

Bart: Now, let's move on to the third layer, OS-level observability. Some listeners might be unfamiliar with this, so can you explain what this layer involves and its importance in the overall observability strategy?

Artem: I think it's okay that only a few readers are familiar with this layer because most people don't need to know about it. However, it's good to have heard about it. The implementation is usually done by certain stakeholders like the platform team or security teams. This layer is critical because it provides deep insights into security, resource utilization, and performance at the infrastructure level, helping to detect issues like unauthorized access attempts or optimal system behavior on the host system, on the OS system, on the OS layer. One popular tool for OS-level observability that I use is Falco with eBPF, which allows for high-performance monitoring by attaching to kernel events and capturing data information without significantly impacting system performance. So, you don't have to worry about putting a tool on it that will slow it down. Falco works very well. The simplest way to explain how it works is to consider a scenario where a shell is open in a container. Most users have probably opened a shell in a container at some point to write to the host file system. In this situation, you should ask yourself why a shell is opening in the container and, more importantly, why it is writing to the file system. Falco will recognize this event and then act based on the rules you have established or deployed. For example, it can simply observe or deny the action. I think most people are not familiar with this layer because they are only users of a platform.

Bart: OS-level observability seems quite technical. The main stakeholders interested in this layer, and why it is particularly important for their work, need to be identified, considering concepts such as eBPF and potentially Falco for cloud-native runtime security, as well as broader observability discussions.

Artem: There are usually two stakeholders that I work with. One is a platform engineering team, similar to mine, which is responsible for the platform and aims to monitor its health, performance, and efficiency. This involves analyzing system calls and kernel events to identify resource bottlenecks, optimize performance, and ensure platform stability. We use this approach for our purposes. However, InfoSec teams also work towards similar goals. The second stakeholder is the security engineering team, or DevSecOps, which protects the company's data and systems. They utilize OS-level observability, similar to traditional VM setups, to detect security threats by tracking system calls and kernel events. This enables them to quickly identify suspicious activities, such as unauthorized access, unusual network connections, or attempts to escalate access, which could indicate potential attacks or vulnerabilities. These two stakeholders are key players in this area, and they might use tools like Falco for cloud-native runtime security, eBPF for extended Berkeley Packet Filter, or follow Pod Security Standards for security configuration.

Bart: With the increasing adoption of serverless and managed Kubernetes services, some might argue that OS-level observability is becoming less relevant. What is your perspective on this?

Artem: It would be nice if a managed Kubernetes solution could cover all of this. As you already said, this is a special topic and requires a lot of necessary skills to fulfill this or to implement that. If you have a managed Kubernetes solution where you can click and it works out of the box, that would be ideal. However, in reality, you will need a team to establish a usable ops stack like this. Having a serverless or managed solution that only requires providing an endpoint for alerting can make it easier. If something happens, you receive an alert and maybe a classification of how urgently you should act, along with a base case of what you should do. But in my projects, the reality is different. You have to decide for yourself and for the company what is important. Because of that, I think it's still relevant. With different managed solutions, maybe in two or three years, it will be possible to have a more streamlined way. However, you also have to consider customized infrastructure from customers, and if you have a solution that allows you to do this, you need to be able to integrate some custom path regulation. For example, HPA is a critical infrastructure provider and has its own regulations, and other companies may have different focuses. You need to have an API to integrate their needs into this, which is not easy to do.

Bart: We've covered a lot of ground in this conversation regarding the layered approach to observability. As we wrap up, could you summarize why you believe this model particularly benefits Kubernetes environments?

Artem: In my opinion, the biggest advantage is that you can talk to different stakeholders and understand their needs better when they discuss observability. This helps developers or platform teams identify gaps in their understanding and recognize what they may not have considered. It assists me in my daily work, and I believe it could help others as well. I think it's a useful approach to work with. Although I'm introduced as a model, I don't think of it that way; I simply try to explain things in a way that's easy for different stakeholders to understand.

Bart: Finally, what advice would you give to those looking to implement or improve their observability strategy, such as Black Box vs White Box Observability in Kubernetes? If you were starting from scratch, for example, what tools or stack would you consider for instrumenting and observing a cluster and its applications, including tools like Prometheus, Grafana, Loki, and Jaeger?

Artem: The first thing to do when implementing a solid observability strategy for a new customer is to clearly define the observability and business goals and try to match them. This involves determining what needs to be monitored and who needs the data. Starting with basic toolstacks that cover all three layers can be difficult. A more feasible approach is to begin with the internal layer and roll out the Kube Prometheus Stack, as it brings a lot of value. After that, tools like Uptime Kuma or Checkly can be used to implement external visibility. Kube Prometheus Stack also provides black box monitoring. Additionally, tools like Falco can be used to cover OS-level operations, and tracing with Jaeger can be considered. The key is not to try to solve everything at once, as it takes time and skill to deal with the right layers. Successfully setting up observability tools does not necessarily equal value generated for the company. Therefore, it is recommended to start slowly and increase the scope over time to avoid being overburdened.

Bart: The last time you joined the podcast, you were in the process of writing a book. Tell us about it, possibly the book "Implementing GitOps with Kubernetes".

Artem: It was completed and published a month ago. The book is called Implementing GitOps with Kubernetes, written by Pietro and me, and it covers topics like the one we discovered today. In addition to the many hands-on activities in the book, we focus strongly on approaches because tech stacks come and go, and they are deprecated faster than they can be published. We also focus on approaches similar to clean code, using programming languages to demonstrate how approaches work, especially with GitOps. By combining these, we hope to help readers understand the concepts.

Bart: What's next for you?

Artem: I try to make more contributions to our open source projects by looking at tools, improving them from the user's point of view, and then trying to push them via blogs. I believe open source contributions deserve more recognition, as they suffer a lot from big companies. It's a difficult field if you're not going to pay for the contribution. I try to continue helping them and gain some visibility.

Bart: Very good. Last but not least, how can people get in touch with you?

Artem: The easiest way is a direct message over LinkedIn. It's still the preferred method.

Bart: I worked on our case, so I can speak to that from personal experience. Artem, thank you for joining us again on KubeFM. I really liked hearing your insights. As we mentioned at the beginning, observability means different things to different people. I enjoyed taking a look at this and understanding more about the stakeholders. Keep up the amazing work. We'll be in touch. Take care.

Artem: Thank you for having me. Have a nice day, Rob. Bye.

Bart: Cheers. You too.