Effective Kubernetes troubleshooting: from telemetry to team dynamics
This interview discusses advanced strategies for debugging, team management, and testing in modern cloud-native environments.
In this interview, Adnan Rahić, Staff Developer Advocate at Tracetest, discusses:
The importance of quality telemetry in troubleshooting, emphasizing OpenTelemetry as the gold standard for tracing, logging, and metrics in complex systems.
Effective team structure for Kubernetes environments, advocating for specialized roles within platform teams to enhance developer experience and system reliability.
The need for improved observability in testing, particularly for distributed and event-driven architectures, to reduce time-to-resolution for test failures.
Relevant links
Transcription
Bart: Alex spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. He stressed the importance of learning while troubleshooting. Is there any practical advice you learned during the years regarding debugging?
Adnan: First, you need to have a quality telemetry setup to know exactly what's happening within your system. OpenTelemetry is the golden standard for this, and having a set standard with quality telemetry in your code, instrumented properly, will significantly help with after-the-fact troubleshooting. Whenever you're paged or need to quickly fix something, you need detailed data on what went wrong. OpenTelemetry is the best tool for this, primarily because of its Tracing aspect, which is the best on the market today. With the recent addition of Logs and Metrics support, particularly log support, which has reached a stable state, it's now a one-size-fits-all tool that doesn't cut corners.
Having a set process within your team is also crucial. This includes tying issues to certain teams and having a pipeline to assign tasks to specific developers. This approach eliminates the blame game, where team members argue about who is responsible for an issue. With a set process and a way to test what went wrong, how it went wrong, and who should fix it, you can resolve issues quickly.
Bart: Bob Wise has been with Kubernetes since the beginning, and he mentions how one of the biggest challenges around Kubernetes is about how people are managed. What do you think about that?
Adnan: I would say the structure of a team has to be laid out where everybody specializing in one certain area should be highly focused on that area. Let's say you have a Platform team. Usually, platform teams have SREs, developers, and QAs - a mix of everything. The team itself isn't focused on the product, but rather on developer experience, reliability, performance, and the quality of the code and features the product team can generate. I think that if you can specialize a team within the team itself, where one part of the team is working only on performance, you can set up Profiling tools and Observability tools for that end. Then, you also have other parts of the team that focus on cost, making sure the infrastructure cost is as low as possible, and they focus only on that. Having a clear structure in your team, knowing who's best at what, allows for further specialization and makes it easier for everybody to collaborate.
Bart: What's the biggest mistake that people are making regarding testing?
Adnan: The biggest mistake people are making is that they don't have enough insight into what went wrong in the test. It's usually a test failure, and then they have to figure it out. This needs to change. We need to have a quicker time to resolution with what exactly went wrong within the test, because we're not testing monoliths anymore. We have microservices, serverless, cloud-native, and a bunch of different architectures that are highly distributed and event-driven. You never really know what went wrong; you just know something broke, and then you're spending an hour or even days trying to figure out why it happened. So, we need to get better observability within the test results themselves.