The evolution of managed Kubernetes: from infrastructure to workloads

Guest:

Gari Singh

From cluster management strategies to managed Kubernetes services, this interview explores practical approaches to running Kubernetes at scale.

In this interview, Gari Singh, Outbound Product Manager at Google (GKE), discusses:

The trade-offs between shared clusters and dedicated clusters, advocating for an architecture that supports both models depending on workload isolation requirements.
Why managed Kubernetes services like GKE Autopilot are essential for the future, allowing teams to focus on workloads rather than infrastructure management.
His vision for Kubernetes' next decade, where infrastructure becomes completely abstracted, and teams can focus solely on workload management without worrying about nodes

Relevant links

Transcription

Bart: Who are you? What's your role? And where do you work?

Gari: My name is Gari Singh. I'm an outbound product manager, which is a mystery to anybody, and I work for GKE, specifically on containers.

Bart: What are three Kubernetes emerging tools that you're keeping an eye on?

Gari: Some people may say it, but I'm actually looking at anything Wasm-related. I think there's possibly something there, whether it be Spin, Wasm Cloud, etc. I'm also interested in some of the eBPF stuff, specifically Pixie. I think there are some interesting tie-ins to how we can get insights into GPUs and monitoring, and some fast stuff in there. That's really the only area of interest for me, to be honest.

Bart: That's all good. Now, on the seventh of multi-tenancy, one of our guests, Artem, shared that between a single cluster with multiple environments and dedicated clusters per environment, the latter is easier to manage for a small team. When you share a cluster with multiple teams, should you share a single cluster or offer a dedicated cluster per tenant? What does your choice depend on?

Gari: That is a fantastic question. I'll answer it slightly differently. I think you should always plan to be able to support multiple clusters because, at some point, even with huge clusters, you're going to run out of resources. You could always have an architecture that supports that. Generally, the model works with multiple namespace-based isolation within a cluster if the apps have similar profiles, similar upgrades, and similar updates, especially for stateless apps, where the blast radius is not as bad. However, people want to bring in things that are not dedicated, so you can only do authorization, not full separation. If you truly need full isolation, want to install your own versions of operators, and limit the blast radius, then you've got to go to a dedicated cluster model. Personally, I think you should have a model that supports both and abstracts across namespace-based isolation as a service, but can run on a single cluster or across multiple clusters.

Bart: GKE Kubernetes Managed Services. Our guest argues that GKE is one of the best Kubernetes Managed Services because a lot of complexity is taken away from you. For example, he mentioned GKE Autopilot as a way to provision the correct node size and optimize your utilization automatically. What's your opinion on Kubernetes Managed Services? Are they all created equal? Should they offer a more friendly Kubernetes wrapper as GKE? Or only the basics without any opinion on how you want your Kubernetes?

Gari: Fantastic question. I'm actually a big fan of the GKE Autopilot model. Not just because I work at Google, but prior to coming to Google, I was running about 5,000 clusters and I thought it would be great if someone could manage my nodes, help with upgrades, and manage resizing. I think things like GKE Autopilot and Node Auto-provisioning make sense for a majority of users. They should be able to focus on their workloads, not the infrastructure. However, there also needs to be an option for people to drill down. If we want Kubernetes to continue to grow in the market, I would say that's essential.

Bart: Autopilot, being more fully managed, is probably the way to go, with the ability to break out when you need to. Auto-scaling - Karpenter - our guest Gari prefers Karpenter over Cluster Autoscaler for cluster auto-scaling, highlighting its benefits and reliability. In particular, Gari and his team used Karpenter to consolidate workloads and save around 40% of their cloud bill. Karpenter was also donated from AWS to the Kubernetes project. How do you think this will shape the future of autoscaling in Kubernetes?

Gari: Another good question. I think there are some interesting issues in the Cluster Autoscaler itself. Some of the issues that Karpenter tried to address, we had addressed ourselves as well before with things like Node Auto-provisioning and provisioning the right size nodes. We have an alternative, I would say, with Custom Compute Classes today to meet that sort of functionality. I like the fact that Karpenter is now part of SIG Cluster Autoscaler. However, I don't like the fact that we diverged a bit. It's kind of like writing your own scheduler. I think this has affected the trajectory of it, bringing back the need to do more work on the Cluster Autoscaler to make it more flexible and add some of that functionality. Hopefully, some of the work we've done with Custom Compute Classes, and the work Karpenter has done, can be brought back into a more uniform stack so that we don't have to rewrite the entire power of the Cluster Autoscaler three different times.

Bart: Upgrading clusters. Our guest Pierre stressed the significance of tooling and automation in managing Kubernetes clusters at scale. He and his team built tooling and developed procedures to test, manage, and upgrade hundreds of Kubernetes clusters. What's your strategy and process for upgrading a Kubernetes cluster?

Gari: We've been spending a lot of time trying to figure out how to manage a worldwide global fleet of clusters on customers' behalf for upgrades. There are many things involved in it. We've tried to automate upgrades to make them seamless and well-tested. Some of the things we've learned are from customers. We see people using blue-green node pools to node pools. We see people who want to upgrade only once a year and need maintenance windows and maintenance exclusion. If we can build enough automation and trust into the managed provider system, that's probably the better way to go. However, there will always be people who want to have self-control and control everything themselves.

As you talked to our previous guest, Nick, about fleets, we've tried to do that across as many environments as possible. We'll try to integrate things like rollouts, upgrading your dev environment first, then waiting until your test environment, and then your staging environment. I think we can achieve this with managed services. However, I agree that some people will do it themselves because they have their own specific rules they have to follow.

Bart: Kubernetes turned 10 years old this year. What should we expect in the next 10 years to come?

Gari: My hope for Kubernetes is that it will still survive the test of time. The API still seems powerful, and it's great that we can run GenAI workloads on it rather than building a new platform. In 10 years, I'd like to see us no longer talking about nodes. Instead, I'd like us to focus on elastic compute that brings the right compute when and where it's needed, without having to discuss the underlying infrastructure. I think we can get to a point where we simply focus on workloads, and I believe we can achieve this in about 10 years.

Bart: What's next for you?

Gari: I'm just going to keep going and next year I'll be out there. GKE will turn 10. We had the 10 years of Kubernetes this year. So I'll be out there trying to spread the word on GKE, pushing things like Cluster Autoscaler and driving containers. My cheesy statement would be: saving the world one container at a time. It's kind of what I do.

Bart: How can people get in touch with you?

Gari: I'm on Twitter, but not that much. I tweet, LinkedIn is probably the best way to find me, Gari Singh. I was the first guy over there. Or you can email me.

Podcast episodes mentioned in this interview

Reducing compute capacity by 40% on EKS with Bottlerocket and Karpenter
with Gazal Gafoor
Upgrading hundreds of Kubernetes clusters
with Pierre Mavro
Foolproof Kubernetes with GKE
with Mathew Duggan
Surviving multi-tenancy in Kubernetes: lessons learned
with Artem Lajko