Kubex Announces E-Book on Right-Sizing GPUs in Kubernetes

Feb 10, 2026

Guest:

Andrew Hillier

Most teams running AI inference on Kubernetes don't know how much GPU capacity they're wasting — and even when they have visibility, the complexity of GPUs and workloads makes it unclear what to fix.

Kubex is releasing an e-book on right-sizing GPUs in Kubernetes, covering practical strategies to improve GPU utilization without sacrificing performance. From choosing the right GPU type based on model precision and compute engines, to leveraging MIG partitioning, time slicing, and supply-side optimization.

Andrew Hillier breaks down why GPU optimization is still a do-it-yourself market — the tools exist, but teams are left stitching them together — and how Kubex's agentic framework automates what most solutions leave as manual work.

Get the eBook

Relevant links

Transcription

Bart Farrell: As always, we'd like to know, who are you, what's your role, and where do you work?

Andrew Hillier: I'm Andrew Hillier. I'm the CTO and one of the co-founders of Kubex.

Bart Farrell: Great. And Andrew, what news are you bringing to our audience today?

Andrew Hillier: Well, Bart, we're releasing an e-book that's a pretty comprehensive view on the optimization of GPUs and AI workloads. We're pretty excited about this because it's a big area we're seeing in our customers where there's a lot of spend going into hosting AI, mostly inference, and effective ways of optimizing that and automating that are, I think, very important going forward.

Bart Farrell: And so this ebook, you mentioned the challenges that customers are facing. Can you give us a little bit more info about specific details around those pain points?

Andrew Hillier: Well, again, what we see is, and I like to use the phrase yield. GPUs are expensive. And when you have an expensive asset, you want to make use of it as much as possible. And what we're seeing is if you don't use them properly, your yield isn't very high. It's like renting a piece of equipment and not using it. It's just a waste of money. So It's really focused on that and especially in the context of inference and when you're running these workloads, how do you drive up higher utilization, how to make use of those assets effectively without affecting performance? Because what we see in a lot of cases is, especially for inference workloads, some GPUs are very well utilized and some less so. A lot of organizations may not even have visibility into that, but even if they do, it's not clear what to do to fix that. Because it's a very complicated area. The GPUs are complicated, the workloads are complicated. So that's what we're tackling. And that's a big focus of the e-book.

Bart Farrell: And in the process of creating this e-book, regarding other resources that may be available out there in the landscape. How does this e-book change in looking at before and after of its arrival for folks that are out there that are struggling with the things you mentioned?

Andrew Hillier: Well, I think there are a lot of tools out there that you can use to maybe gain visibility, like NVIDIA DCGM exporter. There are different schedulers out there. There's a lot of the pieces that are out there. But it's still, I think it's a bit of a do-it-yourself market right now to figure out how to use that to drive up efficiency. I think people are very effective at standing up AI workloads and getting transactions going through them and succeed on that front. And then when they look at the cost and they say, wow, okay, that's pretty big. And so we see FinOps for AI kind of rising and things like that. But what we're trying to do here is kind of give very effective practical strategies to drive that up, not science projects, things you can actually do and you can automate to make that problem go away. So I think It's just like in Kubernetes. I think it took a while as it matured before people started to turn their guns and say, okay, this is working well, but it's costing a lot. In GPUs, the same thing's happening. It's maybe happening a bit faster because they do cost so much. But we're just trying to address that and give some practical guidance to how to solve that problem.

Bart Farrell: And for folks that may not be familiar with Kubex, can you break down Kubex's business model and pricing structure for teams that are evaluating the kind of work that you're doing?

Andrew Hillier: Sure. It's resource optimization. It's priced per virtual CPU. So you basically license it based on the size of your environment, the amount of horsepower that you're running on. And what it does is it uses machine learning to look at all the patterns of activity, of all the workloads, all the replication patterns, all the node utilization and replication patterns. And it comes up with recommendations on how to drive up efficiency. For example, optimize the size of the containers or optimize the node types and also drive down the risk. So we're very effective at eliminating any memory kills and throttling and all the things that you see plaguing these environments. So those things kind of go hand in hand. In one pass, you can kind of fix the cost problem and fix the risk problem and save that. And it's very powerful because there's an agentic framework built into it that lets us do more advanced use cases. And we're just starting to release these now. Things like bin packing and node pre-warming and some of these functions that are kind of being asked for in the market where, hey, my performance isn't great when I have to spin up nodes. When my loads go up, I have to wait to schedule a, well, we have predictive models of what that happens and we can pre-warn them as an example. So there's lots of very interesting use cases that are available that are made possible through our AI agentic framework that's built into the product. So it's a SaaS based model, you can deploy it, there's basically a simple forwarder you can install, a helm chart that will feed the data back and then you get all these rich recommendations and there's an automation controller that will just do it as well. So if you want to go full auto, you just put the switch and it will fix everything automatically. And that's the idea is just to get rid of this cost problem and get rid of the risk problem. And the GPUs are done at the same time. So the CPU, memory and GPU are all optimized together.

Bart Farrell: When people are exploring this space, which alternative solutions might they be considering alongside Kubex?

Andrew Hillier: So there are a number of players in this market. Of course, there's people like CAST AI or ScaleOps. We see them out there quite a bit. And I think for us, they're great products, but we take a different approach. Again, we kind of take this agent first approach core where It's very, very flexible and extensible. So it lets us hit off a lot of use cases and anything you can dream up. So I've been on with customers where they say, hey, I need it to do X, but we can make it do the X because the foundation of our product is it has an agentic framework built into it. So we use machine learning for the core, what we call deterministic answers that are the ones that you generally automate. Like I want to make sure it's the same answer every time. There's a whole series of other use cases that the agentic framework is very good at doing, and it makes it very responsive to customer needs. So the way I see it, it allows us to really expand our capabilities and a really new school framework. So I think there's a number of solutions out there. I'd like to think that we're the most advanced agentic framework for this problem. And so people are looking at them. You look at different ones, but I think that's where when people try us out, that's a really big point is our ability to bring AI to the problem.

Bart Farrell: And for an organization out there that's looking to host AI workloads, what criteria should they use to decide which type of GPU or other processor they should use?

Andrew Hillier: So I think the big dividing line is training versus inference. And we focus a lot on inference. Again, training, but the main point there is that there are certain chips that are designed for certain use cases like those two. On the inference side, it then gets into, well, how big is your workload? What's the model precision? So that's very important. So you can be running an AI workload that's 64-bit floating point or 8-bit integer and everything in between, or FP32 or FP16. So that is an important thing because that drives certain models are designed to handle different, but compute engines are more performant for certain ones. So looking at what your existing workload is using from a compute engine perspective might dictate whether you go to a B200 versus a T4 as an example. So that factors in. And also then whether the GPU is MIG-able or not. So that's a big thing where some GPU models can be divided up into smaller models. And so for certain workloads, that's very advantageous because you can fine tune the size of the GPU partition to match your workload. Other ones aren't partitionable, but they're cheaper. So it might mean that if you can run on T4 and it's cost effective, do it. There are a number of parameters, your performance requirements, how monolithic your workload is, the model precision, even what cloud provider you're running in, different offerings are available in different providers or on-prem. All these things factor into choosing the right one. And what we do is we apply analytics to the problem to say, well, let's analyze existing workloads and recommend what your options are and rank them and say this is the best one. But you're already on that one. So that one will work if you just make it differently. And we can give a number of different answers that. You can either do full auto or we can recommend, hey, if you went to this other provider, you could get the same horsepower for less. That kind of thing. So there's a variety of use cases that come to bear to optimize GPU utilization.

Bart Farrell: And if an AI workload isn't fully using a GPU, what options are there for driving up efficiency?

Andrew Hillier: Well, there's either changing the supply side or changing the demand. So we focus on changing the supply side by things like recommending MIGs. You should be running on half a GPU or an eighth of a GPU of the one that you're on or a different GPU, which is that you run it differently. So changing the supply, it may be you can run a whole GPU or a part of GPU. Things like time slicing come to bear. They will time slice workloads within a GPU, but it's a little more primitive model. There's not a lot of isolation, so you can have noisy neighbor problems and memory kills and things like that, but it's useful for certain workloads. So there's a lot of ways that you can change the way you're running them to make better use of the existing supply. The other one, of course, you can do is just run more transactions through your model, which may or may not be possible depending on the nature of your workload. If you have an AI model that's feeding one of your customers, you can't make them do more at any given time. You're getting what you get as far as the transactions from them. So that's when it's best to kind of flex the supply side saying, okay, I've never seen you use more than a quarter of a GPU. I'm going to run you there. I'm going to watch you and trend what you're doing and make sure I move you up when I need to. I'm going to make sure I right-size the supply to meet the demand. That's the strategy we generally see out there.

Bart Farrell: And Andrew, how does GPU memory factor into efficiency?

Andrew Hillier: Well, it's often the constraint. So it's funny when you look at actual environments, we saw the cases where it's very skewed. In some GPUs, the GPU is fully utilized and the memory is barely used. And in other cases, the GPU memory is fully utilized and the GPU isn't. And it really depends on the model and the nature of the transaction. If you have to load up a big model and wait for someone to ask you a question, you're going to be sitting there using all the memory. It's very important. The GPU memory often is a constraint in what GPU you need to use. Now, we've seen some customers say, well, maybe that model doesn't need to be using that much memory. So there is tuning that can be done. If you can bring the memory usage down, it does have a drastic impact on things because that's often the primary constraint. But it's usually a mix. It's one or the other. And the ideal workload will be balanced in use of both. But often they'll strand memory. So in memory, again, it's not the GPUs themselves don't really share among multiple workloads. Like your general purpose CPU and memory, you're sharing amongst a bunch of containers and you're able to stack memory up and you're able to dictate how you stack them and maybe sacrifice a bit of buffering to get the density you want. In GPUs, it's a little more partitioned. It's just the amount of memory you're using, you either have it or you don't. And you have to meet that. So it's a pretty hard constraint. And often the providers will give you multiple options for GPUs with different amount of GPU memory. And so getting that right is pretty important. That's one of the things we do. So you can get a 40 gig or an 80 gig A100 as an example. And so getting that right and dividing it up in the right MIGs is the key to basically right sizing all the GPUs and their memory for the workloads you have.

Bart Farrell: Looking ahead, what developments can our audience expect from Kubex?

Andrew Hillier: Well, we're doing a lot of work in agentic use cases, really accelerating things on that front. General purpose things like bin packing and node pre-warming, I mentioned those on the GPU side. For things like, well, trending forward to see when I'm going to need more GPUs, that's something that our AI framework is very, very good at doing. So we're seeing a lot of new use cases popping up that are ideal for agentic. They're human-in-the-loop use cases that can rapidly give business value about things like forecasting or spreading workloads or scheduling, pre-warming. The pre-warming is one that comes up where I don't want to wait for my nodes to warm up because I can't schedule the pod until the node's up. And it might take a few minutes. We're seeing that particularly important for AI workloads because the models can be pretty large and to spin up a new node can take quite a while. So I just talked to a customer a couple of weeks ago where that was the main challenge. In fact, it was hard to use Kubernetes for the AI because the node spin-up time was so long. And so that's one of the things that we're addressing through this agentic framework, where because we have all the predictive models, we can actually predictively warm nodes ahead of the loads. And I think that's a really big, important thing for AI because that can be a game changer. It can be a showstopper for using Kubernetes for AI workloads.

Bart Farrell: And Andrew, what's the best way for our listeners to get in touch with you?

Andrew Hillier: Well, kubex.ai is our website, and I think that's the great starting point. There's a lot of information on what I just mentioned there. Of course, you can contact us from that, and you can also spin up a free trial from there as well. So we find a lot of people just want to go in. It's very easy to stand it up and try it in your environment and play with the agent and look at the analytics and your GPUs and see what it's doing. So I think that's the starting point. It's pretty simple to navigate, and it gives quite a bit of information on a lot of things I just described, as well as if you want to try it.

Bart Farrell: to get started. Great. Well, folks, you'll have the ebook available for download. We'll have all the links included here. Andrew, thanks so much for your time. Look forward to talking to you soon. Take care.

Andrew Hillier: Thanks, Bart.