Building Smarter Kubernetes Clusters with Event-Driven Autoscaling

Oct 22, 2025

Guest:

Zbyněk Roubalík

In this interview, Zbyněk Roubalík, CTO at Kedify and founding maintainer of Project KEDA, discusses:

Why traditional HPA falls short - CPU and memory metrics work well for stable workloads but fail with variable traffic patterns like HTTP services
The shift toward predictive autoscaling - Moving beyond reactive scaling to anticipate incoming demand using trend analysis of workload metrics
AI's impact on Kubernetes infrastructure - How frameworks like vLLM are enabling unified AI model serving on Kubernetes, and the need for specialized autoscaling approaches for AI workloads

Relevant links

Transcription

Bart: So, who are you? What's your role? Where do you work?

In this case, I'll link the company name to its website:

So, who are you? What's your role? Where do you work at Kedify?

Zbyněk: My name is Zbyněk. I'm the CTO at Kedify, a company that specializes in Kubernetes autoscaling. I'm also a founding maintainer of Project Keda.

Bart: Now, what are three Kubernetes emerging tools that you are keeping an eye on?

Zbyněk: I will start with Keda, the Kubernetes Event-Driven Autoscaler. For four years, it has already solved many use cases for users and customers. However, there are still areas to be explored, such as storage scaling, stateful workload scaling, and AI workload scaling. There are many possibilities and improvements we can still make. It's my number one favorite project on Kubernetes, so definitely check out Kubernetes autoscaling if you haven't heard about it.

The second project I would highlight is OpenTelemetry. Even though it might not be emerging, I like the direction it is going because it helps unify infrastructure in an open way. You have metrics coming from all places, and I believe there is significant potential. It can also be connected to autoscaling: we can expose OpenTelemetry metrics about autoscaling, providing observability and understanding of the system. Moreover, we can use OpenTelemetry metrics to drive autoscaling itself. I usually try to convince people not to use Prometheus metrics for autoscaling, but instead use OpenTelemetry metrics due to some downsides with Prometheus.

The last tool I would mention is the vLLM project. While not directly a Kubernetes project, it's a framework for running AI models on Kubernetes—a unified approach to serving models. Many projects are emerging to utilize this framework for higher-level model serving. We also want to incorporate autoscaling, as scaling out and serving AI models is crucial and well-connected.

Bart: Now we're going to get into some questions based on topics mentioned in our podcast. Our guest, Thibault, believes that autoscaling is great because it provides relevant trade-offs, allowing you to fine-tune for good stability, efficiency, and response time. What's your advice on autoscaling Kubernetes clusters?

Zbyněk: Configuring autoscaling on Kubernetes is important because, if done well, you can benefit from both cost and performance perspectives. I see many people simply using Karpenter with HPA, but this is not enough. To properly utilize Kubernetes and its ecosystem, you should configure autoscaling at both the cluster level (using Karpenter cluster autoscaler) and the application level, since applications create pressure on the cluster autoscaler's scaling decisions. It's crucial to spend time configuring your workloads and continuously fine-tune the settings to autoscale properly.

Bart: Jorrick discovered that they needed a unified metric, for example CPU, RAM, etc., to compare node resources and trigger scaling decisions. What's your advice when it comes to scaling Kubernetes clusters?

Zbyněk: At the moment, cluster autoscalers or Karpenter are very well configured to scale based on the demand from the pod autoscaler. As new pods are added, they indicate the need for more cluster resources, working well with CPU or memory requests from applications. However, I would love to see a way to connect different types of metrics, including custom metrics that signal an incoming big load. Ideally, we could tell the cluster autoscaler to provision nodes in advance, essentially using KEDA metrics to drive cluster-level autoscaling, which is currently very difficult to achieve.

Bart: Our guest identified two problems with traditional HPA. First, these metrics aren't always the best proxies for demand. And second, these metrics in HPA react more slowly than event-based triggers. What are your thoughts on using CPU and memory metrics for scaling decisions?

Zbyněk: CPU or memory metrics are very good for stable workloads. But as soon as you have more variability in workloads, they are simply not enough, especially when talking about HTTP traffic or GRPC services. In such cases, you need to react more in real time. This is where you should really consider ditching the HPA completely and utilize something smarter that can predict the incoming demand. We should be moving towards predicting events, not just reacting to them.

Bart: Kubernetes turned 10 years old last year. What should we expect in the next 10 years?

Zbyněk: There's a million and a half questions. I would say that everybody would respond to AI, so more AI readiness, both on the serving layer, where we can serve different kinds and sizes of models using frameworks like vLLM. I also see bigger involvement of AI in Kubernetes itself—maybe not directly in the control plane, but something more connected. We will see more autonomous Kubernetes clusters. It might be more difficult to debug, but this is something in the future. We will see it in a few years, definitely.

Bart: And what's next for Kedify?

Zbyněk: At Kedify, we provide enterprise features on top of KEDA, and we are building some really cool features. One of them is predictive autoscaling. We would like to anticipate the demand coming to the application because KEDA is getting all the metrics from the workloads. We want to anticipate the trends in those metrics and start scaling out a little bit in advance. Scaling out is an expensive operation from a time perspective, especially if you have a large application that starts very slowly. We would like to be prepared for the load.

Predictive autoscaling is one interesting feature, and the other is AI readiness, which helps our users scale out their workloads more efficiently.

Bart: And if people want to get in touch with you, what's the best way to do that? Zbyněk Roubalík from Kedify could likely be contacted through the company's website or potentially through the KEDA Slack channel given his involvement with the project.

Zbyněk: The best way to connect is to reach out to me on Kubernetes Slack. We have a KEDA Slack channel, so feel free to chat, ping me, or send a direct message. Connect with me on LinkedIn. I'm happy to talk about not just Kubernetes, but also open source. I often get questions from new people in open source asking for advice on how to contribute. Feel free to reach out, and I will be at KubeCon. Stop by our booth and meet me after our maintainer track on Thursday.

Bart: Thanks so much for joining us. I look forward to speaking to you soon. Take care.

Zbyněk: Thank you. Bye.

Podcast episodes mentioned in this interview

Teaching Kubernetes to Scale with a MacBook Screen Lock
with Brian Donelan
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling
with Jorrick Stempher
VerticalPodAutoscaler Went Rogue: It Took Down Our Cluster
with Thibault Jamet