The future of Kubernetes: from container orchestration to AI platform

Guest:

Itiel Shwartz

Kubernetes evolution, AI workloads, and the future of container orchestration take center stage in this technical discussion.

In this interview, Itiel Shwartz, CTO and co-founder of Komodor, discusses:

Emerging Kubernetes trends including GPU integration, eBPF for cluster communication, and Cluster API.
Challenges of running AI workloads on Kubernetes, particularly the difficulty in troubleshooting and optimizing expensive GPU resources.
Komodor's approach to Kubernetes troubleshooting using advanced root cause analysis with agentic AI models, addressing day two operations with the philosophy to "start small.

Relevant links

Transcription

Bart: Itiel Shwartz, who works for Komodor, is responding to the question about his identity and role. While the specific details of his exact title aren't clear from this transcript snippet, he is associated with the company as a key team member.

Itiel: I'm Itiel Shwartz, CTO and co-founder of Komodor.

Bart: I notice that the transcript snippet is very short and doesn't provide the full context of Itiel's response about emerging Kubernetes tools. Without the complete context, I cannot confidently add hyperlinks. Could you provide the full transcript of Itiel's answer?

Itiel: The first trend I'll highlight is GPU on top of Kubernetes, which is one of the biggest trends we see across the industry. eBPF, mainly for cluster-to-cluster communication, is a huge pain point in technology that is now gaining popularity. And I will go for Cluster API or clusters as cattle—a movement that is starting to take off.

Bart: Bart Farrell asked: Do you agree with John McBride's assessment that Kubernetes is a platform of the future for AI and ML, particularly for scaling GPU computing? What challenges do you see in running AI workloads on Kubernetes?

Itiel: I think the industry is going to train everything and do everything on top of Kubernetes. The biggest challenge is Kubernetes troubleshooting and optimization, which is super tricky. People are purchasing GPUs from AWS or physical hardware that cost tens and hundreds of thousands of dollars, but they are unable to utilize them because of various issues and failures. We need to make sure that this very valuable resource—the GPU—is utilized to its maximum potential, which is currently not the case.

Bart: Our guest said, "I see Kubernetes as a platform, not just a container orchestration tool." What's your perspective on Kubernetes evolution beyond container orchestration?

Itiel: Container orchestration is the base that allows us to build platforms on top of Kubernetes. Looking at the future, five years from now, I think Kubernetes is going to become so dominant that it's not really going to be about Kubernetes itself, much like how Kubernetes is not prominent on top of Linux. The focus will be on giving developers the best platform to achieve business logic. As Kubernetes becomes more successful, it will become less obvious and less prominent, similar to how Linux has evolved.

Bart: Our guest Itiel stated that if you deploy Grafana on Prometheus, it only allows you to see what's happening in the system. He believes this does not solve any problems. What do you consider a complete observability solution beyond Grafana and Prometheus?

Itiel: Grafana and Prometheus is one part of it. APM is another one. Error tracking is another one. The code in Git and GitOps is usually where everything originated from. To have a really good experience when it comes to troubleshooting and solving issues, you need to be in all of those different layers—from the source code to Jenkins or GitHub Actions, Argo CD, the application itself, Datadog, Prometheus—and then use a tool to take all of these data points into one coherent story. Being in all of the different layers is the only way.

Bart: Kubernetes turned 10 years old last year. What should we expect in the next 10 years?

Itiel: My prediction is everyone is going to move to Kubernetes. I've been saying this for the last five years, and it keeps happening. Stateful applications, GPUs, everything will be on top of Kubernetes. In 10 years, Kubernetes will be very similar to Linux in nature—extremely popular, with everyone running on top of it. But it will be much more mature. The question will be: What will be developed on top of Kubernetes? What is the next abstraction maybe on top of Kubernetes?

Bart: Bart Farrell: Itiel, I want to do a bit of an experiment, something we've actually never done on KubeFM, can you say the most technically nerdy thing possible in 30 seconds?

Itiel: Komodor is currently developing RCA, or root cause analysis for Kubernetes. We're using the agentic approach with AWS Bedrock and Claude 3.5 by Anthropic. What we're doing interestingly is our RAG is more than a normal RAG. It uses agents that can call other agents as part of the process. The goal is to take a generative model running on the cloud, calling other generative models to calculate the full root cause analysis, helping our users mimic what a good SRE would do when solving an issue.

Bart: The transcript snippet is very short and lacks context. It seems to be part of a conversation where someone (likely Itiel Shwartz) is being asked about their future plans. Without more context, I cannot confidently add hyperlinks.

Would you be able to provide more context from the surrounding transcript to help me understand what might be hyperlinked?

Itiel: What's next for me? Komodor. Komodor and Kubernetes. We spent the last five years building the best tool for Kubernetes troubleshooting. We expanded to day two Kubernetes management. Going forward, Kubernetes will become more complex, and a tool like ours that helps to simplify will become even more dominant. The future for me is more Kubernetes, more issues, and more customers.

Bart: You mentioned previously about Kubernetes and safer workloads as it relates to AI and ML. I got to know your company when I was running the Data on Kubernetes community back in 2020-21. With that in mind, you said that Komodor is now tackling these day two challenges. How has that process been? What recommendations would you give for companies facing those day two problems?

Itiel: The most important thing is to start small, start with something that is very easy to understand if you got it right and only then build on top of that. I think a lot of companies are trying to tackle a huge task as the first task and they simply fail and fail and fail. Start small, make sure that the first iteration actually works. Building something with AI is super easy. Building something good and reliable is super hard, and you need to understand that.

Bart: Itiel Shwartz suggests people can get in touch with him through Komodor, where he works, though the specific contact method isn't detailed in this transcript.

Itiel: LinkedIn, Twitter, GitHub, phone call—I'm available and happy to talk about Kubernetes.

Podcast episodes mentioned in this interview

Black box vs white box observability in Kubernetes
with Artem Lajko
Simplifying Kubernetes deployments with a unified Helm chart
with Calin Florescu
Saving 10s of thousands of dollars deploying AI at scale with Kubernetes
with John McBride