GPU Containers as a Service

Mar 24, 2026

Host:

Bart Farrell

Guest:

Landon Clipp

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Running GPU workloads on Kubernetes sounds straightforward until you need to isolate multiple tenants on the same server. The moment you virtualize GPUs for security, you lose access to NVIDIA kernel drivers — and almost every tool in the ecosystem assumes those drivers exist.

Landon Clipp built a GPU-based Containers as a Service platform from scratch, solving each isolation layer — from kernel separation with Kata Containers + QEMU to NVLink fabric partitioning to network policies with Cilium/eBPF — and shares exactly what broke along the way.

In this interview:

Why standard NVIDIA tooling (GPU Operator) fails in multi-tenant setups, and how to use CDI with PCI topology scanning to make GPUs visible to Kubernetes without kernel drivers
How to partition the NVLink fabric between tenants using a trusted service VM running Fabric Manager, and why the physical PCIe wiring differs between Supermicro HGX and NVIDIA DGX systems
Why gVisor doesn't work for GPU workloads — NVIDIA's unstable ioctl ABI means Google has to update gVisor for every driver release, and they only support a handful of GPUs
What caused 8-GPU VMs to take 30+ minutes to boot, and the specific fixes (IOMMUFD, cold plugging, kernel upgrades) that brought it down to minutes
How Cilium network policies enforce tenant isolation at the Kubernetes identity level instead of fragile IP-based rules

Where Containers as a Service fits best: inference workloads where AI teams want to ship an OCI image without managing infrastructure or signing multi-million dollar cluster contracts.

Listen anywhere

Transcription

Bart Farrell: In this episode of KubeFM, we're getting into one of the hardest problems in modern cloud infrastructure, running GPU workloads securely, efficiently, and at scale. My guest is Landon Clipp, a software engineer with a background in high-performance compute who's worked on building GPU cloud platforms at companies like Lambda and is now heading to CoreWeave. In this episode, we talk about what it actually takes to build a GPU-based container as a service, from isolating tenants without sharing kernels, to making GPUs visible to Kubernetes without NVIDIA drivers, to partitioning and NVLink fabrics. booting GPU VMs fast, and why a lot of existing tooling simply breaks in multi-tenant environments. If you're interested in confidential computing, kata containers, GPU virtualization, and what it really means to run AI workloads safely on Kubernetes, this one is definitely for you. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped thousands of Kubernetes engineers all over the world level up through their training. Courses are instructor-led and are 60% practical, 40% theoretical. They are given to groups as well as individuals, in person, and online. Students have access to course materials for the rest of their lives. And for more information about how you can level up, go to learnkube.com. Now, let's get into the episode with Landon, welcome to Kube FM. What three emerging Kubernetes tools are you keeping an eye on?

Landon Clipp: I'm looking very closely at Kata containers, first of all, and also confidential containers. I almost see them as the same project. I'm looking at Edera. Edera.dev is a really interesting technology related to, confidential workloads that's coming out and also, Kubernetes dynamic resource allocation.

Bart Farrell: Fantastic. Landon, for people who don't know you, can you tell us a little bit more about what you do and where you work?

Landon Clipp: I'm a software engineer. I'm actually in between jobs right now. So I used to work at Lambda, which is a neocloud for AI compute. And I worked there for two years, and I'm going to be moving to CoreWeave very imminently, which is also another neocloud that a lot of people have probably heard of. My background has always been in high-performance compute. I started off at a high-frequency trading firm here in Chicago called Jump Trading, and I did system administration there. system design, software engineering, all that kind of jazz. And when I went to Lambda, I did a lot of the same thing, except helping to build out the public cloud platform. And I'm just going to be doing even more interesting stuff at CoreWeave.

Bart Farrell: And how did you get into cloud native?

Landon Clipp: It was actually an accident, I guess. So Lambda is a public cloud company. They have a public cloud. They also have, private deployments as well. But that was just how I naturally got into it was working for them and working on building out their back end platforms. But I also got more directly involved with cloud native when I started to research the Kata containers project. So that's how that whole thing started.

Bart Farrell: the Kubernetes ecosystem moves very quickly. What resources work best for you? How do you stay up to date?

Landon Clipp: Going to conferences, honestly. I don't really look at a lot of publications online. obviously there's LinkedIn as a great resource just to see what people are talking about. But KubeCon, I actually went to my first KubeCon in Atlanta and that was just phenomenal. The amount that I learned was breathtaking. So conferences are great.

Bart Farrell: Okay. If you could go back in time and share one career tip with your younger self, what would it be?

Landon Clipp: I would say that leverage is better than money. Don't chase money, chase leverage, chase things that solve business problems. Don't try to go down this path of solving technically hard problems that don't actually solve any problems for the business or for people. try to find what the business is trying to solve, get really good at that, and everything else from there will follow.

Bart Farrell: As part of our monthly content discovery, we found an article that you wrote titled GPU-Based Containers as a Service. So we want to get into this topic a little bit more in detail. You wrote about building a platform where multiple customers can submit a container, say an AI training job or an inference workload, and the platform automatically finds a server with available GPUs. spins up that container in an isolated environment, and manages the whole lifecycle. It's essentially what cloud providers do with virtual machines, but at the container level and specifically for GPU workloads. What problem were you trying to solve? And why is this harder than the regular container platforms that already exist?

Landon Clipp: Traditionally, the way that AI researchers have gotten access to GPU clusters is either through some kind of scheduling system like Slurm or Grid Engine or Kubernetes. And most often these workloads would be. running on these supercomputing clusters bare metal with no virtualization. And that's really easy because the hosts will have NVIDIA kernel drivers installed. So a lot of the existing tooling out there like GPU operators works just fine with that. But the problem with this model is that in order for a customer to get access to this, they have to commit to a ginormous contract that locks them in. for a certain amount of time, and they have to spend a lot of money. And not every customer is going to want that. Some of them are going to want access to a small amount of compute without needing to sign on to these contracts. So the question that I was trying to solve was, is there a way that we can run a multi-tenant supercomputing cluster for AI workloads using Kubernetes? and do it in a way that isolates the tenancies from each other in a safe way. And also, excuse me, it does it in a way that protects their intellectual property and also protects the performance that they expect from these. And this is really hard because when you involve virtualization, when you when these tenants have to run inside of virtual machines, we as a cloud provider don't get access to the NVIDIA kernel drivers anymore. So we lose a lot of visibility into the GPUs themselves. And the problem to solve there is, well, how do you do that when all of the existing tooling out there in the open source community expects NVIDIA kernel drivers? And that's really what I was trying to solve in this article.

Bart Farrell: The setup you're working with is pretty substantial. 100 servers, each with eight high-end NVIDIA GPUs. Before getting into how you built this, what were the ground rules? What did the platform need to do? And what did you explicitly decide not to worry about?

Landon Clipp: The main things I wanted to do was I wanted to give the users of this platform access to not the Kubernetes control plane, but to a very minimalistic containers as a service API. So they're not getting their own control plane. They're getting just an API that says, give me an OCI container, give me some metadata. what is the size, what's the shape of your workload, how many resources do you need, and then just work back from there. And another thing I wanted to do was I wanted to provide access to different shapes of VMs. So I wanted to be able to partition these GPU servers in various different ways. you can think of as like two different classes of hardware that you can buy that has NVIDIA GPUs. They have NVIDIA DGX, which is their reference platform that they build, or you can buy an HGX system, which is made by OEMs like Supermicro, Dell, etc. And I wanted to figure out in all these platforms, they usually tend to have eight GPUs on every server that are connected together through NVLink. And I wanted to see if there was a way that we could provide different shapes of VMs, like 1x, 2x, 4x, 8x instances. How would you do that on these HGX platforms? And also be able to boot these containers really quickly. Because one of the other issues with virtual machines, historically speaking, is that when you attach these GPUs that have really large BARs to them, the VMs take a long time to boot. So I wanted to see if there's a way that we could do it where we could boot between maybe one and three minutes, if possible, just to minimize the boot up time of these workloads.

Bart Farrell: You chose Kubernetes to orchestrate everything, which makes sense. It's built for scheduling containers across the fleet. But almost immediately you hit a wall. The standard NVIDIA tooling that tells Kubernetes this server has GPUs. didn't work for your setup. So what's the issue?

Landon Clipp: The issue primarily comes from us as the Neo cloud provider saying, how do we protect our resources from this untrusted workload? And one of the immediate issues that come into play when you are trying to host multiple tenancies is that you don't want to be sharing a kernel, a list kernel, because there are many CVEs out there where, if you're just running in a regular runc container, there have been exploits that people have been able to use to escape out of the container or in some rare instances, escape out of out of just like any other workload running on it. So the NVIDIA kernel drivers that talk to the GPUs run in the kernel, hence their kernel drivers, kernel mode drivers. And that is going to be beholden to the same exploits that anything else is in the Linux kernel. So you have to give every single tenant their own kernel drivers, which means that like what I was saying before, that us as a new cloud provider, we don't have access to their kernel, we don't have access to their drivers. So now we can't use anything out there that it expects to use NVIDIA drivers to talk to the GPUs to find things like what's the status of the GPU? What's the health of it? Even basic questions like how many GPUs are on this system, because a lot of people will use NVIDIA SMI, which talks to the driver, and we can't use any of that. And it's also been a problem, like I was saying, with GPU operator, NVIDIA GPU operators that expected these drivers to exist. So that is one of the first issues that come into play is that you have to still answer the question of how many GPUs are on this host? What's their status? And how can you expose that to Kubernetes in a similar way that GPU operator does?

Bart Farrell: So Kubernetes can't see the GPUs through the normal path. You had to build something custom to make them visible. How did you solve that?

Landon Clipp: What GPU Operator does is they have this thing called a CDI plugin. CDI stands for Container Device Interface. And it's basically this thing that talks to the kubelet running on every host. And it announces the presence of hardware. And in GPU Operator, it would talk to the kernel driver to ask what's available here. But now we can't do that. So the next place that you can start to look at is just the PCI topology itself. So you can use common tools like LSPCI to see what devices are there. And you can then look for all NVIDIA 3D controllers is what they're called in the PCI topology. And then you can further filter down by like specific NVIDIA devices. So if you're, for example, if you're looking for a GPU, you would look up, okay, I'm looking for NVIDIA devices, which is a code of 10 DE in the vendor space. And then you would look up your specific GPU, which also has its own device code. I forget what it is for H100s, but every single device has a different code. So then that's pretty, that's a pretty easy way to find what exists. You just ask the PCI bus what's there. And then you talk to the kubelet through a custom CDI that I wrote one, but they're pretty easy to make. And then you just say to the kubelet, hey, I have eight GPUs. This is the name of them. And then you just go on with your day.

Bart Farrell: Once a customer's container is scheduled onto a server, it still needs to actually run in isolation. You can't just run it directly on the host for the security reasons you just described. How do you isolate each customer's workload?

Landon Clipp: Like I was telling you before that you can't use the runc container runtime just by itself because it has the problem of sharing a single kernel amongst different tenancies. So what do you do? You have to run them inside of a virtual machine. So that's fine. So the virtual machine solves the problem of sharing a host kernel, but it doesn't solve the problem of isolating hardware from different tenancies. And. For these super pod systems, like I said, there's a GPUs on them and they're connected together through two different buses. Actually, there's the PCI bus and there's also the NVLink fabric. The NVLink fabric typically has four NV switches in it that allow you to do all to all communication with all a GPUs on the host. And you want to be able to make sure that not only can the GPUs not talk to each other on. the PCI bus, but you also need to make sure they can't do it on the NVLink fabric. And NVIDIA provides the Fabric Manager, which just runs as a systemd service that you can talk to tell it to program these NV switches to partition the GPUs in a specific way. So for example, if you want to partition a 2x instance, like let's say you just have one server and you have four 2x instances that you can schedule on it, you would talk to Fabric Manager and say, hey, I want to enable these partitions that relate to two X instances, and then it will go off talk to the NV switches, partition it so that these GPUs outside these partitions cannot talk to each other. And it's a similar thing with PCI, you need to make sure that you can't do peer to peer communication. So what do you do, you just disable ACS. That way, every single PCI transaction has to go to the root complex to get to figure out if it has permission to talk to this other device, which is going to incur latency penalty, but it's just what you have to do in this case. That's just the cost that you incur. So there's that side of it. But there's also the network side of it. So, we're talking about PCI and NVLink, which you can think of as like a network in a way. But you also have the Ethernet network. So that's where something like Cilium comes into play, like container network interfaces. So when you schedule the virtual machines that have containers running on them, you also need to create a network policy in Cilium that says, I'm not going to allow cross namespace traffic. So when a packet comes from the VM, Cilium will inspect the packet, figure out where it's going. look at its policies to figure out if it's allowed to continue. If it is, it will allow the packet to continue. If it's not, then it'll just get blocked. And it's the same for the other direction, the packet coming into the virtual machine. So those are the things that you have to think about. And obviously, there's also considerations with any storage mediums that you have attached. I didn't look at that specifically. I'm leaving that for storage engineers to talk about. But these are the considerations that you have to make.

Bart Farrell: And running every container inside its own VM sounds safe, but it also sounds slow. You had a requirement of booting in under two minutes. What makes GPU VMs specifically hard to start quickly?

Landon Clipp: It's a misnomer that VMs are slow. It is true in specific scenarios sometimes. But it's not a hard and fast rule. So the biggest problem that you run into is the VM boot time. Because like we said before, the host machine does not have the NVIDIA kernel drivers installed. So that means that basically what happens, the virtual machine will boot up. QEMU has to then perform this DMA isolation. on the GPU because the GPU wants to be able to write into the host memory directly. And the way that you the way that QEMU does that is that it talks to the VFIO kernel driver which is attached to the GPU on the host and it will ask VFIO to program this you can consider it to be like a chip it's like a it's a specific component on the CPU itself called the IOMMU. and ask the IOMMU, hey, this is my physical memory that I have allocated for my virtual machine. I want you to isolate the GPU so that it can only DMA into this specific range of memory. So it has to go off and do it. That involves typically four kilobyte pages that the IOMMU can talk. So that's something that's slow. That has to happen before the virtual machine boots up. And then once you do that for all the GPUs, then they get exposed to the guest kernel. And then the guest kernel now has to walk the PCIe topology and discover these GPUs. And then it has to attach the NVIDIA kernel driver to this. Then the NVIDIA kernel driver talks to GPUs, does initialization, does firmware initialization, all this stuff. So all these different things. caused the VM to boot really slowly historically speaking this is a problem that has been solved on multiple different fronts or is being solved on multiple different fronts A lot of it is being solved inside of the Linux kernel itself. So there's a lot of inefficiencies, historically speaking, with the host side kernel, where it was doing ridiculous things when it was doing these DMA mappings. And also when it was setting up page tables for the guest itself. So this is just a high level overview of it. But once you get the VM booted and the guest processes have access to the GPUs, for the most part, you're going to get identical bare metal performance if you do it right, if you configured all the virtualization correctly. That's a big if. A lot of cloud providers don't do that. And there's a lot of pitfalls that can come into play there. But if you do it right, you shouldn't really be incurring that noticeable of a virtualization tax is what people call it.

Bart Farrell: You also looked at an alternative from Google that doesn't use VMs at all. It intercepts system calls instead. Why didn't that approach work for GPU workloads?

Landon Clipp: That was gVisor. So I looked at a lot of different implementations out there. gVisor was one of the most interesting because it's hard to categorize it so what gVisor does is the process runs technically speaking on the host kernel but gVisor will it will intercept all of the syscalls that this process makes sometimes gVisor will or most of the time gVisor will intercept the syscall it will probably serve it itself it doesn't have to do like a vm exit scenario sometimes it'll just forward the syscall to the host kernel. But in this way, gVisor can control everything that the process sees because for the most part, you're not even involving the host kernel. But this is a big problem specifically when you're talking to GPUs because what processes do when they talk to GPUs is that they submit an ioctl syscall to the NVIDIA kernel driver. And inside of this ioctl is a struct. that has pointers, it has file descriptors, it has all sorts of different things in it. And in order to do the translation from the processes understanding of pointers and file descriptors to the host's understanding, it has to inspect this struct and has to do translations, which sounds fine, easy enough, right? But the problem is that NVIDIA doesn't guarantee ABI stability between their different kernel driver versions. So the struct that may have worked for version 575 might not work for 580. So that means that the gVisor developers have to update their code every single time a new kernel driver comes out with these new struct definitions. And gVisor being written by Google, maintained by Google, also Google is now technically a competitor to nvidia because they have their tpus from a business perspective that's just you don't want to be putting yourself behind google which is a competitor to nvidia in order to use nvidia products so and they also show on their documentation they only support a small number of gpus like the h100s and the some like a10s i think so it was just a hard block it's just not something that you can rely on unfortunately.

Bart Farrell: These GPU servers have something most servers don't, a high-speed fabric connecting all eight GPUs to each other directly on the motherboard, separate from the main bus. When you're splitting a server between customers, how do you prevent one customer's GPUs from talking to another's over that fabric?

Landon Clipp: I actually alluded to this. I think I'm getting ahead of myself but I alluded to this a little bit is that it's the NVLink fabric. Like I said, there's typically four switches on this fabric. This is actually not the case for these new super pod systems like NVL 72, where the NVLink fabric actually exits out of the servers and there's like a separate rack full of the switches. But you just talk to Fabric Manager. So Fabric Manager needs access to NVIDIA kernel drivers. And we just said that the host can't install it. So what do you do? Well, there's three different ways that you can run Fabric Manager. And these are all described by NVIDIA on their Fabric Manager documentation. There's what's called the full pass-through model, which is where the guest VM has access to all of the GPUs and all of the NVSwitch devices. And then they can partition it however they want because you're not actually partitioning the GPUs. The other model, which is the one I used, was called the shared NVSwitch virtualization model. This is what you have to use for the most part if you want to slice up your server amongst different tenancies. What you do there is you have a trusted service VM that only the cloud provider has access to. And inside of the service VM, it has the kernel drivers, it has Fabric Manager running inside of it. And then you have a service inside of there. Or maybe you can create some kind of pipe between the virtual machine and the host. that talks to the Fabric Manager and says, hey, these are the partitions I want to enable. And then you send that request off, which is basically just exposed as a Unix domain socket. And it will talk to the NV switches and partition it. And then from that point, you have a very firm hardware block for these partitions so that the GPUs can't talk to anything outside of their partition.

Bart Farrell: Deploying that fabric management service turned out to be harder than you expected. What went wrong?

Landon Clipp: I was originally using Kata containers to do this. And, Kata containers, it instantiates QEMU virtual machines, also other kinds of virtual machines, but primarily QEMU. And I naturally gravitated toward the idea of running Fabric Manager as a daemon set within Kata containers. At the time, that did not work because This just wasn't implemented in Kata containers, but I wouldn't have to look at this again. It might actually work nowadays. But anyway, so what I did instead, because that didn't work at the time, was I just spawned my own QEMU virtual machine with libvirt. So you can create a configuration file that's in a format of XML. It's called domain XML. you give it the VM image which was has been pre-built with fabric manager running inside of it installed inside of it and i spawned this virtual machine with QEMU passed through the right nv switches which you can also find in the pci topology you just look for nvidia devices and you look for nv switches which historically show up as bridge devices although nowadays they for newer generations of hardware they show up as InfiniBand cards so you once you did that, everything worked. There wasn't a lot of issues when I went with that perspective. But the goal is to be able to run this inside of Kata natively, and it might actually work now. I haven't looked at it recently.

Bart Farrell: You were working with servers from a third-party manufacturer rather than NVIDIA's own reference design, and that turned out to matter a lot. What kind of differences did you run into?

Landon Clipp: There was one minor problem I remember running into where Kata so okay so I'll step back a little bit so the way that these devices are physically wired together inside of the PCIe bus matters so you it's literally just a tree and inside this tree you have switch you have devices which are like leaf nodes you can think of them and then a lot of times you're going to have some number of bridges or switches that can connect two devices together. And then eventually you'll have the root complex at the top, which is basically just the CPU itself. And if you have two devices that are behind a switch or a bridge, they could possibly do peer to peer without having to go to the root complex itself. So they can literally just instantiate a peer to peer connection through the switch. And what the Linux kernel has to do is it has to, when it boots up, it walks through the PCIe topology and it figures out which devices can talk to each other without going through the root complex. And when it finds these devices, it puts them into what's called an IOMMU group. And when you have devices that are all within a single IOMMU group, the Linux kernel will not allow you. to pass through just one device from that group you it'll force you to pass through everything in that in that device because it's a danger like you might think that the devices are isolated from each other, but they're really not. So I found some minor differences between how Supermicro created their PCI topology where all of the NV switches were behind a single IOMMU group, which is fine, but NVIDIA DGX systems, they actually had separate IOMMU groups for every NV switch. And Kata assumed that every single device would have its own IOMMU group. And that was an issue because it was just a false assumption. So that's just like one example. There's a lot of other examples like NVIDIA DGX systems. They typically have one InfiniBand card for every two GPUs. And that's another issue for cloud providers just because if like your IB card goes down or something about the link goes down, We have. two GPUs that have to die. Whereas with Supermicro and probably Dell as well, I'm not sure, but they have one InfiniBand card for every GPU. And that just makes it easier to maintain as a cloud provider because it just reduces the blast radius.

Bart Farrell: Beyond GPU and driver isolation, you also need to keep tenants separated at the network level. How do you prevent one customer's containers from talking to another's?

Landon Clipp: there's Two different ways you can go about this. One of the historic ways is you can isolate the tenancies together on the fabric layer. So you can have some kind of service that talks to the Ethernet switches and says, hey, I want you to create some routing tables so that only packets between these physical ports are allowed to communicate with each other or allowed to transfer between them. This comes into play when you have things like virtual private clouds where you need like hard physical isolation. But this is really complicated because it involves distributed VRFs and you have synchronization problems that can come into play. And you need like some pretty good network engineers to maintain this thing. So the easier way is like a Kubernetes native way is to use their container network interface, CNI for short. And what I chose to use was Cilium implements the CNI. And what it does is it works on the eBPF layer. So eBPF is Extended Berkeley Packet Filter. It's a Linux kernel feature that allows you to hook into lots and lots of different parts of the Linux kernel where you can inspect not just packets, but any flow of information. and what it does is it will create a BPF program that will attach this to the network devices going into and out of these virtual machines. And before the packet ever makes it to the virtual machine, it inspects it looks at like what I was saying before, when I was getting ahead of myself, it figures out what this packet where this pack is allowed to go to. And if it's allowed, to go to the virtual machine or come out of the virtual machine, it'll allow it. Otherwise, it'll just drop it. And Cilium is really nice for a number of different reasons. But one of them is that it understands Kubernetes identities, understands namespaces, understands network policies. So it speaks Kubernetes and it just makes management really easy. There's a lot of other CNIs out there that only work on IP prefixes. like subnets and various things like that. But that's really fragile for a number of different reasons. So understanding Kubernetes identities just makes it a lot easier. So that's what I did.

Bart Farrell: When a brand new customer shows up for the first time, the platform needs to create their entire environment from scratch. How does that onboarding work?

Landon Clipp: It depends on what your product actually is, but there has to be a number of different things that happen when a tenant goes into a data center for the first time, the first thing that you need to do is you need to make your Kubernetes cluster aware of this new customer. So the request will come in, you tell Kubernetes about this new customer. In my article, I called this a tenant, I made a CRD call a tenant. And I post that to the API server. And then from there, a lot of other things happen. So the first thing that needs to happen, well, one of the first things that needs to happen is that you need to make a network policy. So that says that this tenant is only allowed to talk to this tenant and maybe the public internet as well. If you are if your product is allowing the customer to accept incoming connection requests, maybe you need to instantiate some that stuff, maybe you need to create some DNS names, some IP addresses, maybe load balancing stuff, all the different network stuff that you might need to do. And then any there's, all different kinds of things that might need to be done on the host itself. Maybe you need to spin up like an NFS mount to their storage volume, whatever it is, whatever startup tasks you need to do. So once that's done, then your data center is ready to accept this tenant. Everything has been isolated properly. And then you can start scheduling their containers on it. So I talk a little bit about how I implemented this, like I said, with a tenant CRD. And that allows you to extend this startup process as much as you want. So like if you have new features in your products, then you can add that on to the onboarding process.

Bart Farrell: There's a subtle trust problem here. Tenants are running their own GPU drivers inside their VMs, which means they could theoretically tamper with the GPU firmware. How do you protect against that?

Landon Clipp: NVIDIA says that their cards will not accept anything, any firmware that does not have a valid signature on it. I haven't looked into this, so I don't know exactly how it works. It's probably some kind of public key infrastructure with public and private keys, where when NVIDIA publishes a new firmware, they probably sign it with some private key that only they know. And then the GPUs themselves can verify the signature with a known trusted public key. public key from NVIDIA. So that process is very well understood. It's how TLS works. Anytime you log into a website, that's your browser does this every single time. But it would be unwise to trust that process always works every single time. And the other issue is that you want to avoid your customers seeing random you firmware versions every time that they boot up their VM. So we as a cloud provider, we have to do firmware leveling. So what that basically looks like is the VM will boot up, maybe the customer does all their work, maybe they flash new firmware onto the GPU, because they just want something different, which is fine. But when the VM dies, then you need to have a kind of service VM attached to the GPU and install whatever default firmware that you have for that data center. And that way you fix two things. You fix the firmware leveling problem and you fix any potentially malicious firmware that they've installed on there that they've been able to hack the device somehow. So and then your service machine exits and then you can accept a new customer. Another thing that you probably want to do is clear the VRAM as well, which could actually take quite a bit of time depending on how large the VRAM is on the device. But you just want to make sure that there's no IP that could possibly be leaked. IP meaning like intellectual property, not like the network. And just, do due diligence as the cloud provider.

Bart Farrell: You tested giving a single customer all eight GPUs in one VM. The results were pretty eye-opening. What happened?

Landon Clipp: When I first tried this, I didn't really know what I was doing. So I spawned a virtual machine with eight GPUs. Well, I started off as a single GPU, and that was able to boot really quickly in about two minutes. And I kept trying more and more GPUs attached to the virtual machine, and I noticed that the boot time was increasing seemingly exponentially. I don't know if it was actually exponentially, but it felt like that. And the effect of that was that when I attached eight GPUs to the virtual machine, it was taking like over 30 minutes to boot. And Kubernetes would time out. It would say, I don't know why this thing isn't booting. And it would just, it would kill the pod and just get into this infinite retry loop. So I did a lot of digging, realized eventually that There were some pretty simple mistakes that I made, like the host Linux kernel version was pretty old and didn't have a lot of these new performance improvements that have been implemented inside of VFIO specifically, but also a lot of other subsystems of the kernel. There were some features of the kernel that I needed to enable, like IOMMUFD, which is I'm not going to go into that, but it's very helpful for shaving off a lot of this DMA mapping issue. And there's also like QEMU specific things that i had to look into but and once i did all that the vm was able to boot very quickly and this is over a process of like three months of like me trying to figure out what's wrong and eventually once i learned all this and enabled it inside the host kernel it booted very quickly there's also there's also Kubernetes and Kata specific things that had to be done. One of the things being that Kata would hot plug these devices into the virtual machine. And that was really slow because it hot plugged the GPUs one after another. And there's no technical reason for that. So now you can do cold plugging, which is literally just all the GPUs are available immediately to the virtual machine. And either if you're doing direct kernel boot, the kernel will immediately be able to see it when it walks the PCI tree, or if you're your firmware, like your BIOS, or if you're using you OVMF, it will see that immediately and you're not waiting for Kubernetes to hand you the next one. So there's all kinds of different things that went into that, but that's the high level overview.

Bart Farrell: After spending months building and testing all of this, where do you think GPU containers as a service actually fits? Is this ready for production and who's it for?

Landon Clipp: This is the million dollar question. Containers as a service, well, there's two different workloads you can consider when you're talking about AI jobs. There's either the training job where you have this model that's ingesting tons and tons of input data, running some back propagation, trying to figure out how much error there is, updating the matrices. And then there's the inference side of it. And with inference, you're not necessarily going to care as much about high performance interconnects because you're not doing these all to all matrix operations, distributed matrix operations. So inference is a natural fit for this thing because the model has already been created. For the most part, the users don't really care about. getting access to an entire virtual machine, what they really want to do is just run the model in production and have it take input tokens and it spits something out. That's really what they care about. So containers as a service is a really nice interface for those users because all they have to do is build an OCI image and give that image to us. They don't have to say, oh I want this virtual machine and once they boot up the virtual machine they might need to run Ansible. And then there's all the craziness that goes on with that. They don't want to be system admins. They just want to be AI researchers. They want to just put it out into production. So that's a really great fit. It's also useful for a specific kind of customer that doesn't necessarily, like I said before, doesn't necessarily want to spend millions and millions of dollars on these contracts where they get access to their own private cluster. And that model works. fine because if you have a team of people who knows how to manage these supercomputing clusters you can run your own inference on this cluster just fine but my bet is that most customers don't want to do that they don't really care about that they just they just want the smallest interface possible and the other benefit of containers as a service is that you can provide auto scaling so if you have excess capacity that's not being used you can allow your customers to auto scale their inference jobs so that it will just automatically spawn on new hardware and they don't even have to care about oh well I didn't sign enough GPUs inside of my contract so I'm just out of luck. It's no it's we as a cloud provider have access capacity that you can just use. You don't have to sign an additional contract for it.

Bart Farrell: Landon, what's next for you?

Landon Clipp: I am in between jobs right now like I said at the beginning and I'm and excited that I'm going to CoreWeave. So CoreWeave is, I think everybody listening to this podcast is going to know about them. So I'm just super excited to start working with them. And I can't share any details on what we're going to be doing, but It'll be using my strengths as an engineer, and I'm really excited to build with them and learn a lot from that team.

Bart Farrell: And how can people get in touch with you?

Landon Clipp: You can find me on LinkedIn. My name is Landon Clipp. There's not many Landon Clipps out there, so you can just search me. Or my blog site, topofmind.dev, is where I post a lot of my articles on technical stuff, on life stuff. so you can, you can reach out to me to either of those.

Bart Farrell: Fantastic. Well, Landon, thank you so much for sharing your time and experiences with us today. look forward to speaking to you again in the future. Best of luck.

Landon Clipp: Thank you very much.

Bart Farrell: Take care. Cheers.

Landon Clipp: Cheers.

GPU Containers as a Service

Relevant links

Transcription