Saving 10s of thousands of dollars deploying AI at scale with Kubernetes
Host:
- Bart Farrell
This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.
Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.
John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.
You will learn:
How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets
Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering
How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly
Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues
Relevant links
Transcription
Bart: In this episode of KubeFM, we're joined by John McBride from the Linux Foundation to break down the challenges and solutions behind running AI inference workloads efficiently on Kubernetes. We'll cover how his team deployed VLLM to serve open-source LLMs while optimizing GPU utilization, the role of dynamic resource allocation (DRA) in managing GPU nodes, and why Kubernetes native scaling capabilities made it the right choice for inference workloads.
John shares real-world lessons from running large-scale AI infrastructure, including overcoming GPU driver issues, reducing cold start times for massive models, and designing cost-effective cloud deployments that save tens of thousands of dollars compared to managed AI services. If you're working with Kubernetes and AI, this episode is packed with technical insights on optimizing inference pipelines, selecting the right models, and keeping costs under control while maintaining performance.
This episode of KubeFM is powered by StackGen. StackGen helps development teams work faster and more securely in the cloud. Whether you're moving to a new cloud platform or trying to improve your current setup, StackGen makes it simple. Their tools automate the complex parts of cloud infrastructure so your team can focus on building great products. Ready to make your cloud journey easier? Click the link in the description to learn more.
Now, let's check out the episode. Hi, John. Welcome to KubeFM. What are three emerging Kubernetes tools that you are keeping an eye on?
John: Three emerging Kubernetes tools that I'm keeping an eye on: First, DRA (Dynamic Resource Allocation), which is coming in beta around Kubernetes 1.32. This will help get drivers and different resources onto Kubernetes nodes dynamically, without relying on cloud provider-specific plugins. For example, you won't need AKS's NVIDIA driver plugin to get GPUs working for AI Kubernetes platforms.
The second tool is Dapr, an event-driven platform for real-time events across distributed cloud and edge environments. You could have edge Kubernetes clusters in retail that phone home to a control plane, or large clusters with multiple nodes performing event-driven tasks.
The third tool is Kraken from Uber, a peer-to-peer Docker registry focused on scalability and availability. In a peer-to-peer network, you can more efficiently share image layers, potentially reducing container startup times. This is particularly valuable in the AI era, where image layers can be massive—typically 5-10 gigabytes for models with LLMs or ML jobs, compared to smaller 2-3 gigabyte images.
Bart: Now, for folks who don't know you, what do you do and who do you work for?
John: Great question. Who am I? I currently work for the Linux Foundation. I was at a company called OpenSauced that was acquired by the Linux Foundation. We're focused on making the LFX insights platform better and bringing insights and understanding to open source ecosystems and program offices.
At OpenSauced, I was deep in the Kubernetes backend infrastructure, building intellectual property by deriving insights from big data consumed off GitHub. This included health metrics, contribution checks like the lottery factor, and deep dives into individual contributions. We also built AI jobs to understand natural language semantics for describing pull requests, issues, and to enable more intelligent search.
Before OpenSauced, I was at AWS working with Amazon Linux and Bottlerocket. Prior to that, I was at VMware doing upstream Kubernetes work on the Tanzu Kubernetes platform. Earlier, I worked on Cloud Foundry—and to anyone who has worked on Cloud Foundry, my heart goes out to you.
I've found myself working in Kubernetes and loving every moment of it.
Bart: How did you get into Cloud Native?
John: I feel like it happened by accident, which maybe many people say. I was working at Pivotal, which was predating the Kubernetes ecosystem. At Pivotal, we had a product called Cloud Foundry—the open source version of Pivotal Cloud Foundry. Cloud Foundry was somewhat cloud native. It exists in the Linux Foundation and CNCF ecosystem today.
The commercial offering was for enterprises with data centers: you have a bunch of VMs, and Cloud Foundry helps spin up and spin down those VMs for workloads. That's a gross oversimplification, but the core paradigm inside Cloud Foundry was virtual machines. Cloud Native and Kubernetes revolve around containers, which is a higher-order abstraction compared to a VM with its own kernel, resources, and virtual memory.
The market decided that wasn't going to be the primary approach, and it was obvious the shift would be to Kubernetes. For technology historians, Pivotal was acquired by VMware, and much of it got wrapped into the VMware Tanzu ecosystem of tools. Tanzu became VMware's opinionated platform for Kubernetes on vSphere, deployable on any cloud or metal.
I joined the team developing the first open source variant at VMware called Tanzu Community Edition. We were all in on Kubernetes, shipping Helm repositories, building Kubernetes nodes, and deep in templating work. Some tooling like YTT and Carvel originated within that project ecosystem. It was almost by accident—or by acquisition—moving from Cloud Foundry into the Kubernetes ecosystem.
Bart: You've been in the Kubernetes ecosystem for a while. It's an area that moves very quickly. How do you stay updated? How do you keep track of all the different things that are going on? What are your go-to resources?
John: People often ask me this question. Sometimes I feel like I don't have a great answer because it's easy to feel like I'm falling behind. This is less a practical answer and more philosophical.
The really important stuff inevitably surfaces. You'll hear about it and start to get signals if you're building Kubernetes, controllers, or building on top of it. There's a lot of noise—especially in the AI infrastructure side, with GPUs, VLLM, Dapr, and other AI technologies.
I would tell people to use resources they like, such as Hacker News and great communities like the r/kubernetes subreddit. Be aware of the noise and be very aggressive about filtering it out. You can get lost trying to keep up with all the cool and sexy things happening, but a lot of it is just noise—and that's okay.
Bart: If you could go back in time and share one career tip with your younger self, what would it be?
John: First principles have been really important for my career. Cloud Foundry was a paradigm around virtual machines, and containers were the future. In hindsight, I wish I had focused more on learning about containers, building for containers, controllers, Kubernetes, and Docker much earlier.
Stripping that back, first principles thinking is crucial. This applies not only to the cloud native ecosystem—containers, the 12-factor app, stateless applications on Kubernetes—but also to working effectively. A key first principle for me is filtering noise and being aware of information sources without getting overwhelmed.
If I could advise my younger self, I would emphasize being very aware of these first principles: learn from them, continue to adopt them, and avoid getting caught up in transient distractions.
Bart: That's good enough. As part of our monthly content discovery, we found an article you wrote called "How We Saved Tens of Thousands of Dollars Deploying Low-Cost Open Source AI Technologies at Scale with Kubernetes" - no pressure with that title. To dive in further, before we get into technical details, I'd like to understand the problem you were solving. What exactly is StarSearch, and what challenge were you addressing with this AI feature?
John: At OpenSauced, we wanted to bring an AI feature to the market. Part of this was the buzz around AI and the opportunity to demonstrate our capability to build innovative solutions. OpenSauced was primarily a business-to-business platform targeting open source program offices and executives to analyze their investment in open source ecosystems.
A great example that resonated with me was the VMware story. They were investing significant engineering resources into upstream Kubernetes and had built custom tooling to understand their return on investment—tracking Kubernetes releases, bug fixes, and the value their engineers were creating for the downstream Tanzu ecosystem.
Customers expressed interest in an AI feature focused on natural language semantic understanding. The concept was to enable queries like "Tell me about people who built Kubernetes 1.28" and have the system parse GitHub releases, usernames, and contributions through natural language questions.
We envisioned an advanced automation tool using AI generation to extract insights from semantic queries. We planned to use LLM inference similar to ChatGPT, employing Retrieval Augmented Generation (RAG) techniques to pull relevant information based on input queries.
Our approach involved using embeddings, vector search techniques like cosine similarity and Hierarchical Navigable Small World (HNSW) graphs to retrieve relevant content. The goal was to understand complex queries and pull appropriate contextual information.
We aimed to scale this to the top 40,000+ starred GitHub repositories, including major projects like Kubernetes, React, Vue, and Golang. However, we encountered challenges with content variability and potential hallucinations.
To mitigate these issues, we developed a data pipeline that first used a large language model to summarize content before vectorization. This helped standardize and clean the input data, making subsequent processing more reliable.
The primary challenge was cost. Running inference on hundreds of thousands of issues and pull requests using third-party providers like OpenAI or Anthropic would have cost around $300,000 annually.
As a scaling engineer, I explored alternative approaches using open-source large language models like Mistral and Llama, leveraging Kubernetes for compute scaling. We utilized tools like VLLM, llama.cpp, and Ollama to enable inference on our own infrastructure.
By deploying a small GPU-enabled cluster in the cloud, we dramatically reduced costs, making the solution far more economically viable compared to third-party services.
Bart: Once you selected VLLM, you needed infrastructure to run at scale. What made Kubernetes the right platform for this deployment? Walk me through the process of how you ended up deciding on Kubernetes.
John: What was really tantalizing about using Kubernetes was not only the compute scaling capabilities where we could dynamically scale up and down GPU resources in the cloud. When there are many more inference jobs, we can scale to tens more GPUs, and when there's not a lot happening, we can scale down. These are pretty well-understood patterns in Kubernetes these days.
The other appealing aspect was running a service internally inside the cluster—an internal Kubernetes service behind Kubernetes DNS and load balancing. It would be something like vllm.service.local.cluster that our microservices could hit internally without going through a third-party provider. It was local and simple to prove out and ship.
We could use standard Kubernetes services and potentially expand into a better service mesh or use internal load balancing with different algorithms. Admittedly, Kubernetes' internal load balancing isn't the greatest, but in startup mode, it worked well for us. We saw the potential to build a small middleware for internal ingress and load balancing much more simply.
Essentially, we were hand-rolling a GPU platform and understanding the operational cost. I was willing to take on that burden versus running something at a much more expensive cost to the detriment of our budget. It was an economies of scale issue: we wanted to scale to such a high degree that using a third-party provider would have been untenable.
We would have needed an enterprise agreement with these companies to ensure the availability required to scale our product. When operational problems arose, I could fix them immediately, unlike relying on a third-party service that might be down for hours. There seems to be a mood shift where third-party services are more volatile than they used to be.
We wanted to run the service ourselves, scale costs as needed, and not be at the mercy of a third-party provider. Kubernetes worked really well for us. At its core, it's an amazing platform for scaling compute, especially GPU compute. In my opinion, Kubernetes is the platform of the future for AI and ML. When you're at the economies of scale where you're using or leasing GPUs from the cloud, it's great to scale node pools up and down and have it just work.
Bart: Now, the VLLM daemon set configuration seems critical to making this work efficiently. How did you configure the VLLM daemon set to properly utilize these GPU resources?
John: I think that was one of the more challenging things we ended up walking through in an exercise of how to solve this for the Kubernetes we have, so that we can run this thing dynamically, bringing resources up and down. I would say we didn't solve it well. In startup mode, we were on Azure, on AKS. AKS has a hand-built solution for getting the NVIDIA drivers for the nodes.
It ended up being their controller on our cluster that could watch for nodes labeled with a specific skew and video. It would watch for that label and then deploy a daemon set with the NVIDIA driver. This is partly why I'm excited about DRA. We were fine going with Azure and AKS, but the cluster API person in me from the VMware days was a bit sad because relying on a proprietary upstream cloud provider in your Kubernetes cluster reduces portability.
I would have loved a solution where we get NVIDIA nodes and it just works—the driver is there, and we can immediately start doing inference on those GPUs without much worry. That's one of the reasons I'm very excited about DRA: being able to dynamically get resources based on node types, SKUs, and different things like NVIDIA GPU drivers.
The NVIDIA driver part is a challenge, not well solved outside of what upstream cloud providers offer. The next challenge was getting VLLM on each node. Our approach was to run a daemon set on each node with a single GPU, labeled appropriately. VLLM would handle pulling in the model and start serving it, looking like OpenAI's API.
The challenging part was VLLM getting the model. We explored pre-baking images and even building our own node images, which we quickly abandoned because we didn't want to build an OS for a GPU. VLLM queries Hugging Face for the specified model, but this means waiting one to two minutes for the node to come up while downloading the model. Some models can be up to 10 gigabytes, which takes time.
This is why I've looked at solutions like Kraken—peer-to-peer networks for faster sharing of node layers or container image data across the node network, which is much faster than querying Hugging Face. In our startup mode, we just pushed forward, thinking "this will work for now" and shipped it.
The two main challenges were the NVIDIA GPU driver setup and the VLLM runtime.
Bart: Now, this sounds like a significant infrastructure investment after implementing the solution. What kind of cost savings did you see compared to using OpenAI?
John: Economies of scale were astronomic. Instead of tens of thousands of dollars every month, we were spending maybe a couple thousand at the biggest scale. In startup mode, we utilized credits from cloud provider agreements. As a small eight-person company, we leveraged existing relationships with Microsoft through their partner program. Our founder Brian had great connections.
For anybody thinking about joining an early-stage startup as a founding engineer or starting their own thing: one surprising aspect is the amount of relationship building in Silicon Valley. It's a lot of networking, which is just business.
The cost for a production-grade cluster on AKS was minimal—maybe a couple hundred a month—with the control plane fully managed by Azure. The significant cost center was GPUs. We chose the cheapest option: T4s, which are considered edge compute-grade but still enterprise-grade.
We had to be mindful of which models we loaded, as anything larger than 7-14 billion parameters would overload the VRAM. When that happened, the GPU would fail and the node would restart—not ideal for a service.
We found a middle ground: GPUs that were cheap, scalable, and could use spot instances on Azure. The goal was to balance GPU capabilities with model performance. The Mistral models were amazing—permissively licensed and perfect for these platforms. I'll always root for Mistral because of that.
Bart: Now, AI workloads can be unpredictable in production. What monitoring approaches have you implemented to ensure your system runs smoothly?
John: That's a great question. Running a hand-rolled GPU platform with daemon sets and GPU drivers can be challenging. Even though Azure has a proprietary controller, it would occasionally fail. After researching, I discovered that GPU workloads in the cloud are actually quite unpredictable.
It's not uncommon for a GPU in a data center to suddenly die, requiring physical removal from the rack and network. Suddenly, your cluster goes from 10 T4s to 8 GPUs. The control plane would typically find new nodes within an hour or two, helping to restore capacity.
We benefited from using AKS, which made many operational challenges easier. Even when resources were unreliable or getting ejected, we could usually maintain enough scale by scaling the node pool back up. That's one of the advantages of managed Kubernetes—you don't have to handle every operational detail.
I've dealt with provisioning metal clusters on vSphere, Tanzu, and Bottlerocket, which can be complex. While running metal clusters can work if you have the right hardware and operational scale, managed Kubernetes simplifies many challenges.
Even when nodes weren't completely failing, we'd encounter issues like VRAM consumption or intermittent failures that would restart and then continue processing jobs. Sometimes these might have been related to the VLLM layer encountering unexpected issues.
We implemented solid observability practices. GPUs emit great metrics, and depending on your cloud provider and observability stack, you can monitor VRAM usage, track long-running inference jobs, and understand the overall platform health.
OpenTelemetry is making progress in thinking about GPU and AI workloads, including considerations for end-user evaluations and ensuring AI outputs remain reliable and safe.
Ultimately, managed Kubernetes worked well for us, and being aware of the nuances of running large GPU compute environments is crucial.
Bart: Model selection seems to be a key factor in making this approach work. What advice would you give to teams trying to select the right open source models for their specific use cases?
John: That's a great question. Things are changing so quickly right now. If you thought cloud native or the programming world with new programming languages, frameworks, AI, ML, and especially large language models were evolving fast, it's moving at an incredible pace. With geopolitical developments, companies like DeepSeek making things cheaper to run openly, and OpenAI releasing new models at seemingly PhD-level performance (though at a premium price), I still deeply believe you can accomplish most tasks with a relatively small model.
I believe you can get away with most things in a model with about 95% effectiveness using a 7 or 14 billion parameter model from Mistral or Llama. The Qwen models are pretty good. Anthropic even wrote about building agents, suggesting you probably don't need complex agent workflows. Most tasks can be accomplished with a good one-shot prompt—providing a single instruction to generate text—versus creating complex multi-agent loops.
There are good practices for evaluating what works for your use case: download a model (Ollama works well for smaller models), and craft a good prompt, which remains critically important. Prompt engineering is huge, even with increasingly powerful models.
In our data pipeline consuming GitHub content, we invested significant effort in creating prompts to summarize content effectively. We learned it's crucial to avoid "garbage in, garbage out" by carefully curating input and designing precise prompts. Good prompt engineering can go an incredibly long way and is becoming its own art form.
I recommend trying these models, crafting a good prompt, and seeing how a one-shot approach works for your workflow. You might expand this by implementing tool calling for the LLM or integrating the one-shot into a data pipeline.
The evaluation problem remains challenging due to the non-deterministic nature of these models. Companies like BrainTrust and Microsoft's Azure AI Studio are developing evaluation tooling. These tools use machine learning to "fuzz" models with prompts, providing assurance about performance.
In our case, small Mistral models worked well for our simple one-shot task of summarizing text, which is relatively straightforward for current language models. For more complex tasks, thorough evaluation practices become essential.
Bart: We've asked many people about how they see the next 10 years of Kubernetes, given that we celebrated 10 years of Kubernetes last year. It's very common that AI comes up in their answers, although I will say that many times they have difficulties making that answer concrete or giving examples. I imagine this might be different when speaking with you. So, thinking about the AI landscape, which is evolving very rapidly, how do you see your architecture adapting as both open source models and Kubernetes tooling advance? What do you see the next 10 years of Kubernetes looking like as more AI and ML workloads make their way onto it?
John: That's a really good question. I have a very hot take: the next 10 years of Kubernetes, especially in the AI ML space, could see another player entering the open source world. Maybe Nvidia would open-source a compute platform for people to use.
While Kubernetes has had an amazing tenure, other systems like Docker Swarm and HashiCorp's platform have emerged. It wouldn't surprise me if a big company like Nvidia, OpenAI, or Google (with Gemini) addresses some current challenges. For instance, downloading large images onto a cluster and getting nodes to handle 10-15 gigabyte workloads is not a good Kubernetes paradigm. Containers were meant to be small, ephemeral, and easy to spin up and down, which is painful for AI ML workloads with big LLMs.
I wouldn't be surprised if another paradigm evolves or Kubernetes adopts a new approach. If I had to critique Kubernetes, it would be its pace in reacting to emerging technologies. GPU drivers and GPU workloads on Kubernetes have been challenging since its inception. Only now, with increased demand for NVIDIA GPUs, are upstream companies being pushed to address these issues.
Having come from VMware, where we sold Kubernetes platforms to vSphere users, I can imagine how difficult it would be to sell a solution for GPU AI ML workloads. My challenge to upstream companies is to continue supporting open source work and dedicate engineering resources to making Kubernetes the platform of the future for AI ML, which I believe it could be.
Bart: What's next for you? And how can people get in touch with you?
John: I'm going to continue building open source AI ML at the Linux Foundation and operating the open source platform. People can get in touch with me at johncodes.com. You can also email me at [email protected].
Bart: Perfect. Thank you, John, for spending time with us today on KubeFM.