Managing 100s of Kubernetes Clusters using Cluster API

Managing 100s of Kubernetes Clusters using Cluster API

Host:

  • Bart Farrell

Guest:

  • Zain Malik

Discover how to manage Kubernetes at scale with declarative infrastructure and automation principles.

Zain Malik shares his experience managing multi-tenant Kubernetes clusters with up to 30,000 pods across clusters capped at 950 nodes. He explains how his team transitioned from Terraform to Cluster API for declarative cluster lifecycle management, contributing upstream to improve AKS support while implementing GitOps workflows.

You will learn:

  • How to address challenges in large-scale Kubernetes operations, including node pool management inconsistencies and lengthy provisioning times

  • Why Cluster API provides a powerful foundation for multi-cloud cluster management, and how to extend it with custom operators for production-specific needs

  • How implementing GitOps principles eliminates manual intervention in critical operations like cluster upgrades

  • Strategies for handling production incidents and bugs when adopting emerging technologies like Cluster API

Relevant links
Transcription

Bart: In this episode of KubeFM, I got a chance to speak to Zain Malik, a platform engineer specializing in large-scale, multi-tenant Kubernetes systems. He shares how his team managed thousands of pods across clusters capped at 950 nodes, and how limitations with Terraform and node pool consistency led them to adopt Cluster API for declarative cluster lifecycle management.

Zain explains how they contributed upstream to improve support for AKS, handled critical bugs in production, and integrated GitOps workflows to remove manual intervention in upgrades. In this episode, we take a closer look at using Kubernetes as a control plane, building custom operators for production-specific needs, and maintaining infrastructure reliability with minimal staff.

Zain also discusses tooling like kubectl blame, emerging support for partial device allocation via DRA (Dynamic Resource Allocation), and the future of efficient GPU utilization in AI workloads. If you're operating Kubernetes beyond the basics, this episode offers insight into what it takes to scale cleanly and build with intent.

This episode of KubeFM is sponsored by Learnk8s. Since 2017, Learnk8s has been training Kubernetes engineers all over the world. Courses are instructor-led, 60% practical, and 40% theoretical. They are taught in person as well as online, to groups as well as individuals. Students have access to course materials for the rest of their lives so they can keep learning. For more information, check out Learnk8s.io.

Now, let's get into the episode. Welcome to KubeFM. So first of all, what are three emerging Kubernetes tools that you are keeping an eye on?

Zain: These days, if I have to mention three tools, I would start with KEDA. It is the Kubernetes event-driven autoscaling tool that is pretty underrated. If used rightfully, it has a lot of potential, especially when it comes to custom metrics autoscaling.

The second tool I really enjoy is kubectl plugins. Particularly, kubectl-blame leverages the field manager of Kubernetes to see which last client has modified Kubernetes objects, making troubleshooting easy. It helps you quickly preview what is happening with certain objects and who is modifying them, without needing audit logs.

The last thing I've been keeping an eye on is the development around DRA (Dynamic Resource Allocation). It is progressing full speed and is really promising because it will change how we mount or attach underlying hardware to our pods. This adjustment was long overdue.

DRA addresses current gaps in areas like consumable devices, allowing us to use partial resources such as partial GPUs or network cards. For example, we can now use one bandwidth from two NICs, which is a significant step forward for Kubernetes.

Bart: Great. For people who don't know you, can you share a little bit more about who you are, what you do, and where you work at Exostellar?

Zain: I'm a software engineer specializing in infrastructure and platform engineering. My work focuses on using Kubernetes as a control plane. I've been building platforms, operators, and controls to solve infrastructure problems, ranging from Kubernetes and cluster management to addressing auto-scaling challenges. I've been driving this work for a few years now.

Bart: Very good. How did you get into Cloud Native?

Zain: My entry into Cloud Native was many years ago, but it was catalyzed when I started working for Mesosphere. Contrary to popular belief, Mesosphere contributed significantly to Kubernetes, Cloud Native, and open source technologies. I was fortunate to be part of that journey. Their influence on the Kubernetes community was obvious from the start, especially in container orchestration with the Mesos scheduler, their work in Container Storage Interface (CSI), and later the KUDO initiative (Kubernetes Universal declarative operator).

I got to see how different Special Interest Groups (SIGs) in Kubernetes work, their development cycles, and the community. It was a great opportunity to be part of the broader Kubernetes ecosystem, working to make infrastructure more efficient and easier to use.

Bart: Okay, and the Kubernetes ecosystem moves very quickly. How do you stay updated? What resources do you use?

Zain: I rely on a mix of resources to stay up to date. I would not say there's only one source I need to read. I read the Last Week in Kubernetes Digest, which is really good for tracking Kubernetes development. I also stay current with a few Substack blogs and collections of articles. This helps me navigate what's happening because sometimes new authors publish interesting things that get missed on traditional channels. I think it's valuable to have aggregators like KubeFM to see what's going on in the community.

Bart: If you had to go back in time and share one career tip with your younger self, what would it be?

Zain: Ship it is the most important thing. Then actually polishing the results and working on something that needs to be done in a perfect way. I think the easiest part for career growth is shipping everything as soon as possible. Learn from the feedback, whether constructive or any other kind that helps shape your career. That would be the easiest approach.

Instead of trying to build something that is semi-perfect and being afraid of negative feedback, early in your career you have to get rid of that mindset. Get the feedback because we cannot cover all use cases. It's better to cover one use case than trying to cover 10 and not shipping anything.

Bart: As part of our monthly content discovery, we found an article you wrote titled "Cluster API". We want to dig into this a little more. Your team at Exostellar is managing a large-scale Kubernetes operation. Can you describe the scale and nature of your infrastructure setup for our audience?

Zain: I'm no longer working for the City Story system, but last month I changed. I'll go through the architecture we had. City Story system was managing multi-tenant Kubernetes clusters. At one phase, we had almost 30,000 pods across different clusters running at any time, which could fluctuate depending on peak times.

The clusters were different sizes offered to our tenants. Some were around 400 nodes, and some went up to 950 nodes. The 950-node limit was essentially the natural limit of Kubernetes on AKS, so we couldn't go beyond 1,000 nodes. We had to divide the cluster at a maximum of 950 nodes.

The pod density in the clusters varied significantly depending on the critical infrastructure or workload. Some clusters had just one or two pods per node, while others could have up to a thousand pods, making management easier. In nodes with higher pod density, we knew we couldn't exceed 300 to 500 nodes. On a good day, we could easily run 10,000 pods on 500 nodes—essentially very packed nodes running together.

Bart: What specific challenges did you face with your original cluster management approach that drove you to seek a better solution?

Zain: Our original setup was infrastructure as code, but it was more Terraform-based. There were several challenges that were slowing our speed because we were a small team that had to stay up to date on cluster management. One of the main pain points was the provisioning time, which would take almost one to one and a half weeks to get everything up and ready to offer to users.

The other issue was the fragmented number of tools we were using to manage the cluster. We used Terraform to provision the cluster, but we also had an operator managing node pools. Every time there were synchronization issues between Terraform state and the other systems, we would encounter problems.

By the nature of Terraform, if a power user or sysadmin with privileged access changed something in the cloud, we would not know until the next time we ran Terraform plan for that particular cluster. When these situations occurred, we had two issues to solve: first, to fix the state and identify who made the change, and second, to address the original task we initially wanted to accomplish.

The work became more of a firefighting effort where we were simultaneously trying to fix things and understand the impact of the changes. These inconsistencies not only delayed migrations to different clouds but also required numerous manual interventions, putting significant strain on our small engineering team.

Bart: You mentioned adopting Cluster API for managing these clusters. For those in our audience who might not be familiar with this more advanced technology, what is Cluster API and what made it particularly appealing for your use case?

Zain: Cluster API is a powerful Kubernetes project that allows you to create and manage clusters using declarative configurations. It uses Kubernetes as a control plane where cluster definitions are present inside Kubernetes clusters through different Custom Resource Definitions (CRDs). Depending on the cloud provider, there might be CRDs specific to that provider. However, there's a cluster object that remains the same for all cloud providers, whether it's EKS or AKS.

This is similar to the pattern Crossplane follows, building Kubernetes on top of Kubernetes itself using it as a control plane. What was more interesting for us about Cluster API was how you build these clusters—it's not just about using cloud provider APIs to get a cluster running. There are certain practices and ways of doing things embedded in Cluster API that make it easier to adopt and move towards production readiness.

The key abstraction was our ability to go multi-cloud and migrate from one cloud to another while maintaining the same Kubernetes cluster representation across different clouds. This was the main factor that prompted us to invest in Cluster API.

Bart: It sounds like Cluster API wasn't immediately ready for your specific needs. What roadblocks did you encounter, and how did you overcome them?

Zain: Initially, the Cluster API project was designed for self-managed Kubernetes clusters. For example, if you want to spin up a new cluster in Azure on Virtual Machine Scale Sets, AWS, or GCP, you would create clusters with control plane, Kubernetes API server, and everything running in those VMs, fully self-managed.

In our case, we were slightly different. We wanted to use managed Kubernetes clusters like AKS. It was not a self-managed Kubernetes on top of Azure, but AKS. The support was still experimental, with small initial capabilities in place, but a few things were missing.

For example, when creating a Kubernetes cluster for AKS, we wanted the control plane version to be different from node pools. In large-scale deployments, you cannot upgrade the Kubernetes control plane and node pools simultaneously. The typical approach is to upgrade the API server first, ensure no inconsistencies or broken clients, and then slowly upgrade node pools over weeks.

These features were not supported in Cluster API. So we contributed to the Special Interest Group (SIG) for the Cluster API provider. They have an awesome community of maintainers who were very helpful. We explained our requirements and how they would fit into the production readiness roadmap.

The maintainers guided us on how to make contributions, open pull requests, and test features. Most of the missing features were implemented by them, with some PRs opened by us. The maintainers' spirit was to value our skills and help us progress.

Within a month or less, the critical features we needed to run Cluster API in our infrastructure were almost complete.

Bart: Even after creating clusters with Cluster API, there seems to be a big gap between having a basic cluster and having one that's truly ready for production workloads. How did you address that?

Zain: Cluster API offers a robust fundamental foundation to create a production-ready cluster, but there are parts specific to each organization's business needs. Cluster API is slightly opinionated about how to do things, which may not always fit perfectly.

For example, in an Azure AKS cluster, we might want access to multiple container registries with granular access—something that wouldn't fit into the standard Cluster API cluster definition scheme. However, the advantage of choosing Cluster API is its Kubernetes-based control plane. We could write another operator that creates resources alongside the cluster definition custom resource, adding the additional configuration we need.

This approach allows us to create ACR permissions for pulling images to multiple registries without imposing our specific niche requirements on the broader community. The main lesson is that we can leverage the community's direction with Cluster API and then extend it to make it more production-ready and robust for our specific needs.

Bart: Node pool management seems to be a particular pain point in Kubernetes operations. For those exploring advanced cluster management, what specific challenges did you face with node pools, and how did your team solve them?

Zain: Node pool management is particularly interesting and challenging. Depending on which part of the project you're working on and the problem you're solving, you might face different challenges. Notably, the node pool concept itself doesn't exist in Kubernetes. There's no native Kubernetes node pool object.

Cloud providers use underlying APIs like virtual machine scale sets (in Azure) or GKE instance groups as node pools inside a cluster. When relying on these underlying APIs, we naturally inherit their limitations. For example, if creating a virtual machine scale set in Azure, it must have the same hardware specifications, and the configuration is typically immutable once created.

If we want to recreate a node pool or change its configuration, we may find ourselves constrained by the cloud provider's API limitations. What appears to be a node pool limitation is actually an abstraction built on top of the cloud provider's API.

To address these challenges, we can develop an automation strategy with a node pool operator that creates a true abstraction matching our specific needs. This might involve properly draining old node pools and creating new ones when making changes to immutable configurations.

Bart: Throughout the article, you mention your commitment to GitOps principles. How central was GitOps to your automation strategy, and how does it fit with the Kubernetes operator pattern you've implemented?

Zain: GitOps is very fundamental, not just for cluster management, but also for the development of our own operators and control. We're still using GitOps to deploy certain operators. For us, the main source, even before moving from Terraform to Cluster API, was still based on GitOps. We need a single source of truth where we can review through pull request processes, and it will automatically get deployed to the target or management cluster.

This is particularly important when performing critical procedures, such as node pool upgrades. It cannot depend on someone's laptop or UI actions without a review process. Instead, we open a pull request to change, for example, from Kubernetes version 1.28 to 1.29, and we can review the potential impact. We ensure we don't upgrade all node pools simultaneously, but rather one node pool at a time. This is the fundamental value GitOps provides us as a source of truth.

Bart: Being an early adopter of new technologies often comes with its fair share of risks. Can you share an example of a challenge or incident you faced as an early adopter of Cluster API for AKS?

Zain: I can share a war story about the Cluster API. No matter how well you have tested your end-to-end scenarios or performed proof of concept in staging clusters, things break in unexpected ways. This is particularly challenging for emerging technologies.

In our case, there was a bug in the Cluster API, specifically with the AKS CAPZ implementation, that would only be triggered when scaling down the cluster. During scaling up, the issue wasn't apparent. When migrating from Google to Azure, we started with one or two nodes, gradually increasing to 5%, then 50%, and eventually 80%. Everything seemed stable.

After shifting completely to Azure at 100%, the first scale-down event revealed a critical issue. The bug involved provider IDs of node pools in Cluster API, where node identification could be mixed between different pools. This could trigger a cascade effect, potentially misidentifying and deleting nodes across multiple node pools.

During this scale-down event, approximately 60% of our nodes were deleted without graceful cordon or drain. This caused an outage, triggering immediate alerts. We identified Cluster API as the source and temporarily mitigated the issue by scaling down and manually scaling up the deployment.

Fortunately, when we reported this to Cluster API maintainers, they responded quickly. Despite the time difference—evening in Europe and morning for them—they joined a call within minutes. In just 15 minutes, they confirmed the bug, and 30 minutes later, a pull request was created and fixed.

This zero-day incident doesn't mean we should avoid Cluster API. It's been long fixed, and as early adopters, we helped prevent similar issues for other users and customers.

Bart: Your team managed to double the number of clusters while maintaining the same number of engineers. How did this transformation impact operational efficiency and team dynamics?

Zain: The evolution of our tooling and possible automation meant that we were doing less firefighting and spending less time on node pool upgrades. Without growing our team, we were able to manage a larger number of workload clusters. The operational efficiency increased dramatically in a short period of time.

The number of engineers dedicated to node pool upgrades was reduced to just one engineer handling the task of upgrading all production clusters. This meant we were using fewer people and could rotate different engineers through the task. Before Cluster API, only one person had comprehensive knowledge of how node pool upgrades would play out.

With Cluster API, we simply needed to create a GitOps pull request, approve it, and ensure things were progressing as expected. Previously, upgrades often caused outages affecting users—workloads would get stuck, causing impatience and potential disruptive actions like deleting workloads, which would then lead to user complaints.

After implementing our automated operations, we established defined rules where we would not disrupt workloads if they followed certain practices and were within specific time windows. Removing human intervention from these operations significantly improved our efficiency in managing Kubernetes clusters and the upgrade process.

Bart: Looking ahead, what's next on your roadmap for scaling Kubernetes management even further?

Zain: Right now, the topic gaining more attention is multi-cluster management. We can manage multiple clusters, but not through the same multi-cluster operators. The focus is on workloads deployed across different clusters, where workloads are scaled and potentially evicted from one cluster and deployed to another.

There are interesting tooling emerging, such as the fleet manager from AKS, Orchestra from Google, and Karmada. These tools are reaching maturity after being experimental for a few years, which is something I'm keeping an eye on.

Another area I'm constantly working on is Kubernetes cluster efficiency. We've long complained about waste when running things on Kubernetes, and this waste is now catching up with us. The priority is to address efficiency, focusing not only on CPU optimization but also GPU optimization, in an open-source and standardized way without compromising reliability.

Bart: For teams who might be inspired to follow a similar path, what would be your most important advice based on your experience with Cluster API and automating Kubernetes management?

Zain: My advice would be to take bold bets on emerging technologies. There will always be some issues with open-source software, but just because others are not using it doesn't mean we should not. If we start from scratch, there's a lot of work involved. By betting on these technologies and contributing to their deficiencies, we help mature the Kubernetes ecosystem, which benefits teams in the long term.

Do not be afraid of engaging with the community. Attend SIG meetings and get involved—less hating, more appreciating. If someone is not solving something, it's likely because they already have a lot on their plate. See this as an opportunity to contribute rather than criticizing the project's pace.

Lastly, leverage the operator pattern as much as possible. Kubernetes provides this for free—writing operators, controllers, and frameworks built on Kubernetes patterns will eliminate human errors, streamline processes, and help establish a baseline for platforms built around Kubernetes objects.

Bart: I noticed that the transcript snippet is very short and lacks context. Could you provide more of the surrounding conversation or the full context of the question "What's next for you?" This will help me better understand the potential hyperlinks and resources that might be relevant.

Zain: I'm moving toward compute efficiency, which is how the market is trending. There's a lot of waste happening on CPU and GPU, and significant opportunity to optimize this. Looking at China, where they have a GPU chip deficiency, they're able to work with less by optimizing every memory byte used to train their models and frameworks.

Emerging cool operators from China are pushing the boundaries of getting the most out of limited resources. The mentality should shift from thinking we need thousand-GPU clusters or more chips, to using existing resources more efficiently.

We should leverage Kubernetes patterns like Node Problem Detector and KEDA—technologies proven to work at scale—and apply these learnings to the AI and ML communities. I'm looking forward to working on different projects that focus on GPU optimization, and so far, it's looking really promising.

Bart: There's plenty of work to do. What's the best way for people to get in touch with Zain Malik?

Zain: You can reach out to me on LinkedIn. Just search for my name, Zain Malik, and I'm sure I will appear there. I currently work with AccessTrader (Nightpay), and my GitHub profile is also available. Feel free to send me a message about my work. I'm not very active on X anymore, so LinkedIn is the best way to connect these days.

Bart: Well, Zain, thank you for your time today. Thanks for sharing your knowledge, and I look forward to catching up with you in the future. Take care.