From GPU optimization to background orchestration

Nov 25, 2025

Guest:

Dan Mattox

In this interview, Dan Mattox, Senior Director of Engineering at Exostellar, discusses:

GPU optimization beyond utilization metrics - How teams should focus on allocation, activity, queue times, and workload lifecycle
How DRA in Kubernetes 1.32 will make GPUs first-class citizens instead of bolt-on additions
Strategic build vs buy framework - A clear decision-making approach where core business functionality should be built in-house for deep knowledge, while non-core components should leverage open source and CNCF community solutions

Relevant links

Transcription

Bart: So, who are you, what's your role, and where do you work?

Dan: I'm Dan Mattox, and I run engineering at Exostellar.

Bart: In terms of the Kubernetes ecosystem, what are three emerging Kubernetes tools that you're keeping an eye on?

Dan: The most exciting aspect from the conference was the AI conformance tests coming out from CNCF and seeing where those could be going to really get AI platforms ready in the Kubernetes landscape. Separate from that, DRA is super exciting. I think it's a really foundational change to how Kubernetes is operating. Everything in the eBPF space for network routing is also exciting to see where it's going.

Bart: Cilium is now overtaking OpenTelemetry as the second most popular project in the CNCF. Our podcast guest Dave observed that GPUs are such energy-intensive devices that energy was considered from the ground up. NVIDIA DCGM gives pretty good estimates of full card power. What GPU metrics do teams overlook that matter for optimization? Is utilization even the right primary metric anymore?

Dan: Largely, what we've seen internally and from talking to many customers is that the scarcity and cost of GPUs drive other metrics to be emerging as super important. This includes allocation—ensuring GPUs are available and can be used, activity meaning they're actively running workloads, and utilization, making sure resources are being used properly. On top of that, the scorecard focuses on queue times and overall workload lifecycle: how long it takes to get scheduled and run. The goal is to improve the efficiency of clusters and jobs, ensuring end users, AI practitioners, and developers get a high-quality experience.

Bart: Another guest of ours, Alessandro, thinks you should check the CNCF landscape before building custom tools to stand on the shoulders of giants. How do you approach the build versus buy decision in Kubernetes?

Dan: This is a great question. In my opinion, if it's part of your core business and core responsibilities, building is always the answer. Deep knowledge and truly understanding how the product and platform work is crucial to ensure you can support your users and use cases. If it's not part of your core business, lean on open source, leverage what's in the CNCF community, and see what you can either buy or implement to alleviate that burden on yourself.

Bart: Our guest mentioned that dynamic resource allocation coming in Kubernetes 1.32 will help dynamically allocate drivers and resources to nodes without provider-specific plugins. How do we see DRA changing the way we manage specialized hardware like GPUs in Kubernetes?

Dan: DRA is really a major shift in the way GPUs are being managed within Kubernetes, making it more of a first-class citizen. Many device plugins before were more of a bolt-on add-on that required accelerator providers and chip manufacturers to do specialized and customized implementations. Moving to DRA allows us to take more advantage of accelerators overall, including GPUs, SmartNICs, and similar technologies. It puts us in a more GPU and AI-focused direction, prioritizing getting work done efficiently. DRA fundamentally changes the foundation of how resources get allocated. Exostellar has been working on a DRA plugin since it was in alpha and is in the early stages of preparing it for adoption, waiting for customers to be ready to implement and deploy the technology.

Bart: Kubernetes turned 10 years old last year. What should we expect in the next 10 years?

Dan: In the next 10 years, Kubernetes as a platform should become something most users forget. It really should become a behind-the-scenes technology that makes your work get done faster, your servers run longer, your workloads run better, and your applications run more effectively. It should become a behind-the-scenes action. Over the next 10 years, I'm really excited to see Kubernetes drift into the background while still being a very core part of our infrastructure.

Bart: What's next for you, Dan?

Dan: We are pushing our AI infrastructure management product. We hit GA this week, which was a huge milestone for us. We're looking forward to expanding the product further and accelerating beyond that. If people want to get in touch with me, LinkedIn is probably the best way. You can find me at Dan P. Mattox. You can also find me on GitHub at Exo-Dan, or email me at danm@exostellar.ai.

Podcast episodes mentioned in this interview

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes
with John McBride
How Policies Saved us a Thousand Headaches
with Alessandro Pomponio
Building a Carbon and Price-Aware Kubernetes Scheduler
with Dave Masselink