GPUs in Kubernetes don't share workloads like CPUs do

Feb 2, 2026

Guest:

Andrew Hillier

GPUs in Kubernetes don't share workloads like CPUs do. There's no Linux scheduler balancing demand — your GPU gets what it gets, often sitting idle between inference requests.

Andrew Hillier, CTO at Kubex, explains why GPU utilization in inference workloads is often 2-3x lower than requests, and what you can do about it.

In this interview:

Why GPUs are harder to optimize than CPU/memory — rigid partitioning vs. flexible scheduling
MIGs vs time slicing vs dedicated GPUs — when to use each approach
How in-place resizing (Kubernetes 1.33+) changes resource optimization
The tradeoffs between safety (MIGs) and density (time slicing)

Relevant links

Transcription

Bart Farrell: So first things first. Who are you, what's your role, and where do you work?

Andrew Hillier: I'm Andrew Hillier. I'm the CTO of Kubex. Yeah. That's where I work, and that's who I am.

Bart Farrell: Fantastic. What are three emerging Kubernetes tools that you're keeping an eye on?

Andrew Hillier: Well, that's a good question. The biggest one probably is in-place resizing. I think that's been around a while now, and I think it's a very powerful strategy for resource optimization, which we can leverage very effectively. So it's not exactly new. It's been out since 1.33, but we're only now starting to see it, get upgraded to in our customers. So of course, there's a lag when you see it out there. But I think that's one of the biggest things, because it gives you so much more freedom without having to evict pods to resize them. So that's a big one. I mean, things like OpenTelemetry, we're always keeping an eye on. Data collection and integrating into existing frameworks, observability tools, is something that we watch very closely as well. So I think those are the tops in my mind that we're keeping an eye on.

Bart Farrell: So in clusters that are running inference, what are the biggest reasons GPU requests end up two to three times higher than actual consumption? And how would you systematically reduce that gap?

Andrew Hillier: Well, I think that the challenges in inference, if you look at training, of course, things run hot for long periods of time. It's almost like a batch job. But inference is more transactional. So we see a lot of ebb and flow in the utilization of the resources. So the problem is, with a GPU, is you have to size it, kind of the high watermark of utilization if you want, good performance. If it's non-production, you can probably push that. But if you want good performance... And so we see a lot of cases where maybe at peak times the GPU is fully utilized, but at off-peak times, it's not. It really depends on the nature of the chat or the end user interaction that's happening. So, like any transactional workload, you can't really run it too hot. And that's made a bit worse by the fact that GPUs don't really share between workloads the way a CPU would. So in a CPU, of course, running with the Linux kernel, then lots of things will get scheduled to share and you can fill the gaps in utilization. But GPUs, that's much harder to do because they're usually running more of a monolithic model. And so what you get is what you get. So, That kind of leads you to have to look at other ways to optimize them. We see a lot of, MIGs, being able to divide up a GPU into MIGs to make sure it more rightly reflects the requirement. Time slicing comes up. It's a little bit different. Maybe not as good for production workloads. But there's various strategies you can use to kind of get more yield out of those GPUs, By sharing them.

Bart Farrell: And what are the hard tactical limitations of GPUs in Kubernetes compared to CPU and memory, and how do those limitations change your optimization approach?

Andrew Hillier: Well, like I mentioned, there's not really a Linux scheduler running inside the GPU running with all the different loads. It kind of runs the loads that are being, Fed to it. So it's much more rigid, it's much more partitioned, For CPU and memory, especially in a Kubernetes environment, you can run up the CPU utilization by scheduling more jobs and maybe you can set priorities. And if things get too busy, you just get throttled, Which may or may not be a good thing. But you can do that much more effectively than you can with a GPU because there's not a lot of ways to run more and more work through it. You kind of have to reduce the supply because it's not as easy to increase the demand. Now there are ways of doing that. If you can run more activity through your model, you can drive up utilization. But otherwise, you're kind of partitioned off into a fixed thing. In memory, same thing. it's pretty partitioned. On a CPU and the OS and the memory, there's also buffering. So you can drive up memory utilization and you can actually sacrifice buffering if you want to get higher density, And it'll affect performance a bit. But again, in a GPU it's not the same thing. Your memory is your memory. If you can make the model use less memory, that's great. But otherwise, you're kind of stuck having to give enough memory or you'll get an out-of-memory kill. So, It's much more rigid model without the ability to share that you can't for a general purpose workload.

Bart Farrell: And when would you use MIG versus time slicing versus dedicated GPUs? And what decision, Framework would you apply to choose the right model for cost and performance?

Andrew Hillier: So, I mean, first of all, there's the big split between training versus inference. And let's just talk about inference, 'cause training, again, it makes a lot more sense just to run all the GPU flat out until you've used it up, Until you're done your job. But on inference, because it's transactional, you do need to make sure it's well utilized. So, MIGs, Are a great way to do that because you have a lot of flexibility in how you divide up the GPU and you can schedule your workloads onto a piece of the GPU. And so that's good for if safety is a requirement because it's a critical workload. You're gonna get, y- you don't have the noisy neighbor problem with MIGs. You get a partition. The, the expense of this might be a little more expensive because you have to size it to the high watermark of your workload. Time slicing lets you cram them in a little bit more tightly, but it's a little more dangerous because there's not memory isolation. And so if you have a noisy neighbor or something runs up the memory, you can have out-of-memory kills that are beyond your control. So that might be better for less critical workloads. Really depends on the models and they respond to you okay, letting something restart and respond again. But for some kind of long running transaction, you don't want that happening and having it get killed. So the other thing with, with time slicing is you actually ... you can lose visibility as well because it's harder to get the telemetry back on what container is doing what once you start time slicing, as opposed to MIGs where it's still clearly measured. So they have slightly different profiles. I mean, a full GPU is the simplest, It's a great way to go if you're using the whole GPU. We see customers using smaller GPUs, like a T4 or an L4. They don't need to buy the big ones. You could buy smaller ones which are much more cost-effective for certain workloads. So in some cases, using a whole GPU makes complete sense if it's a good fit for your workload. But if you have a big workload or a complicated workload, then maybe you want to go to one of the more, Sophisticated models that, that allows MIGs and partitioning.

Bart Farrell: You know, we're seeing a lot of news, a lot of buzz around the topic of Kubernetes and AI. What are the coolest things that you've seen recently when it comes to those two technologies coming together?

Andrew Hillier: Well, I think I can, Say one of the coolest things is what we brought out at KubeCon because we brought out the optimization of GPUs.... using AI within our product. So it's kind of like AI optimizing AI. So I know that's a shameless plug, but I think it actually is a really interesting trend where, Those- that's a confluence of both a trend in the management approach, going to an agentic-based management approach, as well as the need to optimize GPUs in customer environments. So I'm a bit biased there, but I thought that- That's maybe not the answer you're looking for, but (laughs) I think it's one of the coolest things, I've seen in a while. Again, there's a lot of stuff going out there, like DRAs advancing and the ability to manage the GPUs in an environment and schedule them properly. A lot of great advancements going on there. Some good advancements in 1.35 that we're tracking that will really help kind of make this a more, Dynamic solution to the problem where we can really, Optimize the GPUs in a very powerful way, which just saves a ton of money. These things are very expensive and we do see them not well utilized in a lot of environments. And so, The tools are coming in place in the ecosystem to be able to really effectively solve this problem.

Bart Farrell: Andrew, what's next for you?

Andrew Hillier: Well, we're doing a lot of work on Agentic right now. We did, like I mentioned, we did a big push to really move to our next gen architecture we call Kubex.AI, which is, All the analytics, all the raw data available to AI, not training on customer data. That's very important. We basically, It's a- a very clean way to do it where it's very safe to run this, but it's very, very smart. And that's kind of the springboard into our whole next gen of features because now we can crank out features effectively with a prompt. We're working on things like, Bin packing and node pre-scaling, HPA optimization. We can do some pretty advanced functions now, all within the AI, and we can do it s- very safely. A lot of these are human in the loop type use cases, but some of them can be completely automated. It depends on the nature of the use case, but our finding is once we've, Kind of gone to this Agentic framework that's- that's woven into everything we do, We can greatly accelerate what we're doing. And it's really exciting. Almost every call I do with customers now, we come up with a new agent idea, a pod spreader instead of a bin packer and all these things. And so- and once you have these ideas, it's a very short connection from the idea to the actual implementation because it's prompts. It's basically all the pieces are there. You just have to express the- the pure business logic of what you want to do and- and you get it back. So it's a really exciting time for us because we kind of laid a lot of the foundation to kind of springboard forward, And really advance both GPU and just general Kubernetes optimization.

Bart Farrell: And what's the best way for people to get in touch with you?

Andrew Hillier: Well, Kubex.AI is our website. That, of course, is the- the fastest way. And, I'm sure we can post my coordinates, With this post. There is- it is free to try out. So everything I'm describing, There you can simply go to a- a free trial page and spin it up if you want to see what it looks like in your environment and play with the agent. And, So, yeah, our website, I think, is one of the best ways to do that.

Bart Farrell: Fantastic. Thanks so much for joining us. Speak to you soon.

Andrew Hillier: Thanks, Bart.