The evolution of Kubernetes: from tooling to troubleshooting

Guest:

Itiel Shwartz

In this interview, Itiel Shwartz, CTO of Komodor, discusses:

The rise of new Kubernetes tooling trends, including GenAI solutions, Backstage for platform engineering, and running databases on Kubernetes.
Why troubleshooting Kubernetes requires understanding root causes rather than quick fixes, especially with complex issues like CoreDNS and networking problems.
How Kubernetes' future might mirror Linux's evolution, becoming a foundation layer with more abstracted interfaces for developers.

Relevant links

Transcription

Bart: Who are you? What's your role? And who do you work for?

Itiel: So, Itiel Shwartz is the CTO of Komodor, and I work for Komodor.

Bart: Which three Kubernetes emerging tools are you keeping an eye on?

Itiel: I'll mention three main tools. Firstly, GenAI and Kubernetes - it's not really a tool, but a set of tools. I'm seeing a lot more adoption of using GenAI to solve Kubernetes problems, and there's a very popular open-source project dedicated to that. Secondly, Backstage, which I think has already won the fight. It's the most popular open-source project on top of Kubernetes, and we see all of our customers trying to adopt Backstage or one of the hosted solutions, as it's gaining a lot of popularity. Thirdly, I'll mention databases on Kubernetes, where I've seen much more RDS, MongoDB on Kubernetes, and Apache ZooKeeper on Kubernetes running on top of Kubernetes. I think that's what's going to happen. And, obviously, like Komodor, we help make Kubernetes simple.

Bart: These questions are from podcasts that we've done. This first one is about availability and platform engineering. One of our guests, Hans, compared delivering software now to 20 years ago. He mentioned that while downtime was acceptable in the past, it isn't today. Hence, building platforms on top of Kubernetes requires more tooling than ever. Is it possible to keep tools from sprawling out of control? What kind of tools are essential for building mission-critical platforms?

Itiel: That's a great question. So, it is possible to keep things under control, but it's super hard. The easiest thing to do when you have a problem in Kubernetes is to add another plugin, add-on, or something that will magically solve your problems. In the end, you're left with so many different moving pieces that you don't really know what's happening. You need a very strict team leading the vision of the product and the technology to ensure that while solving problems, you're not digging yourself a bigger hole. I will say that now it is easier to run on top of Kubernetes in a high-latency, low-fault-tolerant environment. Kubernetes is more stable, and it is easier, but still tricky.

Bart: Troubleshooting tips and the learning path. One of our guests, Alex, spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. He stressed the importance of learning while troubleshooting. Is there any practical advice you've learned over the years regarding debugging?

Itiel: Debugging kernel code is not a common troubleshooting experience for most people. The best recommendation I can give is to learn from past incidents and understand not only the issue itself, but also how things worked before the issue arose. One of the most common issues in Kubernetes is networking problems. People often solve these issues by making small changes to their nginx configuration without fully understanding the scale and impact of how networking works in Kubernetes and what caused the problem they're experiencing. Learn from past incidents and understand why they happened, rather than just solving the immediate problem.

Bart: Monitoring in CoreDNS. One of our guests, Ferris, described an incident where core DNS pods were not scaling, degrading connectivity for the rest of the cluster. They had monitoring set up in the cluster, but this didn't prevent the issue. With so many metrics you could observe, how do you make a mental model for what's really important to measure?

Itiel: So, I'll admit I'm a bit biased - Komodor does help with managing core services. When it comes to Kubernetes, you need to manage two main areas: the infrastructure itself, including nodes and clusters, and what we call core services, such as CoreDNS, API server, and kube-proxy. If you're running Istio, this also includes all the services that have an impact on your systems, even if they appear to be applications. In reality, they are infrastructure code in disguise. Make sure to monitor these with the same level of scrutiny that you apply to your infrastructure.

Bart: Kubernetes turned 10 years old this year. What should we expect in the next 10 years to come?

Itiel: Much more Kubernetes. I think Kubernetes on edge and Kubernetes in retail are becoming increasingly popular. In 10 years, I'll be honest, I think Kubernetes by itself is going to be a bit smaller. It's becoming like Linux, basically, which means someone will build something on top of Kubernetes to make life easier. I don't think in 10 years we will interact with all of the API server, even if it is Kubernetes that is running all of those things.

Bart: Right now, what would you say is your least favorite Kubernetes feature?

Itiel: It's hard to choose only one, but I'll go with the lack of history. Kubernetes is super ephemeral, and once issues happen, it's very hard to trace back and understand what happened and why. You spend a lot of time monitoring, but Kubernetes isn't really troubleshooting-focused at its core. I think that's one of its main problems.

Bart: What's next for you?

Itiel: More Komodor. What we do is simplify Kubernetes at scale. Luckily for us, Kubernetes is becoming more popular and more complex by the day, so we keep trying to find the places where we can empower enterprises to get more from Kubernetes.

Bart: How can people get in touch with you?

Itiel: Over LinkedIn, email, Slack, Komodor Slack channel, pull requests over on our open source, I'm very available.

Podcast episodes mentioned in this interview

Troubleshooting a validation webhook all the way down to the kernel
with Alex Movergan
CoreDNS will fail you at scale (with default settings)
with Mohamed Hamdan Faris S M
Platform engineering: learning from the Kubernetes API
with Sven Hans Knecht