Kubernetes autoscaling strategies: from production risks to AI

Kubernetes autoscaling strategies: from production risks to AI

Dec 8, 2025

Guest:

  • Zbyněk Roubalík

Explore the intersection of auto-scaling, AI workloads, and the future of Kubernetes with insights from a KEDA project maintainer.

In this interview, Zbyněk Roubalík, CTO and Founder at Kedify, discusses:

  • How to properly tune auto-scaling at both cluster and application levels

  • Why Kubernetes is becoming the platform of choice for AI/ML applications, leveraging recent improvements in Device Resource Allocation (DRA) for GPU scheduling

  • Future developments in AI-based scaling that could dynamically adjust workloads using multiple metrics rather than strict schedules

Relevant links
Transcription

Bart: So who are you, what's your role, and where do you work?

Note: While the transcript doesn't contain any specific technical terms that require hyperlinks, I noticed the speaker works for Kedify, which could be a potential link of interest.

Zbyněk: Hi, my name is Zbyněk. I'm a maintainer of the KEDA Project and also the CTO and founder of Kedify, the enterprise version of KEDA.

Bart: I notice that the transcript snippet is incomplete and lacks context about the three Kubernetes emerging tools. Without the full context, I cannot confidently hyperlink specific terms. Could you provide the complete transcript or the full response from Zbyněk Roubalík?

Zbyněk: I like VLLM, KEDA, and OpenTelemetry.

Bart: One of our podcast guests, Thibault, believes that autoscaling is great because it provides the relevant trade-offs, allowing you to fine-tune for good stability, efficiency, and response time. What's your advice on auto-scaling Kubernetes clusters?

Zbyněk: If auto-scaling is done right, you can save a lot of resources and money, while also improving performance. To auto-scaling effectively, you need to be careful about setting scaling both at the cluster level—how to scale out nodes—and at the application level. This means using the right metrics to scale your applications, not just CPU and memory.

Bart: Our guest also warned that auto-scaling needs to be finely tuned. If it starts to fail, you'll probably experience significant issues. How do you approach the risk of auto-scaling in production?

Zbyněk: You need to tune auto-scaling properly, constantly checking metrics and observability tools to see how the application performs. Additionally, you should ensure that when auto-scaling your applications, you don't cause problems in downstream services. Imagine you configure auto-scaling, and while front-end applications handle a sudden surge of traffic well, the back-end—such as the database or other services—may suffer. Therefore, make sure to test all pieces of the architecture, including indirect components.

Bart: Our guest John McBride expressed that Kubernetes is a platform of the future for AI and ML, particularly for scaling GPU compute. Do you agree with this assessment, and what challenges do you see in running AI workloads on Kubernetes?

Zbyněk: One of the reasons why Kubernetes is the platform of the future for AI workloads is that we are already running standard non-AI workloads on Kubernetes. We would like to run AI workloads on the same infrastructure, so we don't need to manage multiple different environments.

The first step is that Kubernetes provides flexibility. With recent improvements on the Device Resource Allocation (DRA) and other features, it allows scheduling the right amount of resources and applications on particular GPU instances. This needs to be properly captured through GPU metrics and incoming requests to your models.

Bart: Kubernetes turned 10 last year. What should we expect in the next 10 years?

Zbyněk: I expect that we will have a Kubernetes that nobody will need to manage—it will be some AI-driven system. No, I'm kidding. I think there are exciting things, especially around AI workloads. We might see possibilities we couldn't imagine a couple of years ago, given the maturity and stability Kubernetes already provides.

What's next for us? We would like to improve KEDA even more, build more enterprise features, and have satisfied users. With the AI hype, we currently have predictive scaling, but we want to build AI-based scaling. Imagine feeding various metrics to an AI that can help us dynamically scale workloads, not just based on a strict schedule.

Bart: And how can people get in touch with you? Kedify

Zbyněk: Reach out to me over LinkedIn or Slack channels on Kubernetes and CNCF Slack.

Podcast episodes mentioned in this interview