How Policies Saved us a Thousand Headaches
Host:
- Bart Farrell
This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io
Alessandro Pomponio from IBM Research explains how his team transformed their chaotic bare-metal clusters into a well-governed, self-service platform for AI and scientific workloads. He walks through their journey from manual cluster interventions to a fully automated GitOps-first architecture using ArgoCD, Kyverno, and Kueue to handle everything from policy enforcement to GPU scheduling.
You will learn:
How to implement GitOps workflows that reduce administrative burden while maintaining governance and visibility across multi-tenant research environments
Practical policy enforcement strategies using Kyverno to prevent GPU monopolization, block interactive pod usage, and automatically inject scheduling constraints
Fair resource sharing techniques with Kueue to manage scarce GPU resources across different hardware types while supporting both specific and flexible allocation requests
Organizational change management approaches for gaining stakeholder buy-in, upskilling admin teams, and communicating policy changes to research users
Relevant links
Transcription
Bart: In this episode of KubeFM, Alessandro Pomponio from IBM Research walks us through how his team went from a simple CLI prototype to a fully managed Kubernetes platform, all without a dedicated DevOps team. He explains how they built a GitOps-first architecture using Crossplane for infrastructure provisioning, ArgoCD for application deployment, and HelmFile to keep configuration consistent across environments.
Alessandro shares how they manage multi-environment setups with minimal drift, using custom health checks and Git-based promotion flows to maintain reliability during rapid deployment cycles. He also talks about why they chose to build their own platform abstraction rather than adopt something like Backstage and what trade-offs they've encountered along the way.
For teams working with constrained ops resources, this episode offers real-world strategies for platform scaling, infrastructure as code reuse, and operator-friendly Kubernetes work.
Special thanks to Testkube for sponsoring today's episode. Are flaky tests slowing your Kubernetes deployments? Testkube powers continuous testing directly inside your Kubernetes clusters, seamlessly integrate your existing testing tools, scale effortlessly, and release software faster and safer than ever. Stop chasing bugs in production. Start preventing them with Testkube. Check it out at testkube.io.
Now, let's get into the episode. Alessandro, welcome to KubeFM. What three Kubernetes emerging tools are you keeping an eye on?
Alessandro: I would recommend three tools:
First, Kyverno, a Kubernetes-native policy engine made by Nirmata. It truly speaks Kubernetes, uses YAML, and has extensive documentation and example policies.
Second, ArgoCD, probably the most famous GitOps tool alongside Flux. They recently released version 3, so it's worth checking out.
Third, OLLAMA, a new project announced by Red Hat at Red Hat Summit. Their goal is to make Kubernetes the de facto choice for serving large language models (LLMs) at scale. If you're interested in LLMs, definitely explore this tool.
Bart: So, can you tell us a little bit more about what you do and where you work?
Alessandro: I'm Alessandro Pomponio, a research software engineer at IBM based in Dublin. I'm working on next-generation systems and cloud for science and AI.
Bart: And how did you get into cloud-native?
Alessandro: I always had a passion for cloud-native, the idea of software at scale, but also managing infrastructure at scale. Even when I was in university, I tried to find courses about cloud and distributed systems. With a bit of luck, I ended up where I am.
Bart: Congratulations on being lucky. In terms of the active work you do as a researcher to stay on top of the different changes in the cloud-native ecosystem, what resources work best for you? Blogs, podcasts—what works best?
Alessandro: I like having a feed of things. I usually check out LinkedIn. There are many experts on LinkedIn that you can follow. There are pages for official projects and foundations like CNCF, Continuous Delivery Foundation, and Linux Foundation. Just engage with pages like that. Also, follow people and experts such as Artem Laiko, one of the guests on your podcast, who is very knowledgeable about platform engineering. I follow him, for example. There are other folks like Christian Hood who share tools from the Kubernetes ecosystem every week. Find what suits you best and the forums you like most.
Bart: If you could go back in time and share one piece of career advice with your younger self, what would it be?
Alessandro: My career isn't that long, but for folks at the start of their career, I would say try to get some hands-on experience. Early in your career, it's basically imposter syndrome city. If you have some experience that you can use to back yourself up, it really helps.
Bart: Good advice. As part of our monthly content discovery, we found an article you wrote titled "Taming the Wild West of Research Computing: How Policy Saved Us a Thousand Headaches". At IBM Research, you manage computing resources for some cutting-edge science. Could you tell us about the environment you're supporting and the challenges you faced?
Alessandro: My team works on a topic called accelerated discovery. The idea is to speed up the scientific discovery process through new tools and technologies. There are many talented people working on this from various backgrounds—people from AI, HPC, chemistry, mathematics, and other disciplines. This is amazing for science, but for people managing Kubernetes and OpenShift clusters who might not share our background, they can sometimes cause challenges.
For them, Kubernetes and OpenShift are simply a way of obtaining resources for scientific work, which is perfectly valid. My team manages two large bare-metal OpenShift clusters for accelerated discovery, and we've definitely encountered issues with misuse and non-cloud-native practices.
For example, we saw people monopolizing GPU resources by spawning interactive pods or launching extremely large batch jobs. Due to the Kubernetes scheduler's behavior, these jobs would be scheduled on nodes with GPUs, often overwhelming the nodes' CPU or RAM capacity and preventing GPU-enabled workloads from running.
Initially, we tried to handle each case manually—logging into the cluster, investigating issues, and providing guidance. However, we quickly realized we couldn't keep up with the demand. We needed to create an automated approach with enforcement rules that we as administrators would define for our users.
Bart: Now, those resource contention issues sound familiar to many cluster administrators. What specific goals did you set to address those challenges?
Alessandro: We had some processes that were too manual when starting to deal with a ton of requests. We realized that a few admins were doing things manually on the cluster without real visibility between each other's actions. For example, if I did something and forgot to update a Helm chart or inform others, they wouldn't know what I did.
We wanted to adopt the GitOps approach, creating a single source of truth for our cluster configuration that would provide everyone visibility and traceability of admin actions. When we give a certain group a resource quota increase, we want to track when it was given so we can reclaim it when needed for other groups to perform their actions.
We also wanted to handle resource starvation, especially near deadlines when people are trying to complete experiments using GPUs for LLMs and other scientific computations. We aimed to ensure everyone has a chance to use these resources.
We wanted to prevent people from creating pods they use like VMs via SSH, as that's not how cloud native works. Instead, we encouraged creating jobs that run their computation and then terminate, freeing resources for others.
We also sought to address the issue of large batch jobs being scheduled on GPU nodes. Previously, we would ask people to add affinity rules to prevent this, but people often forgot, which didn't scale well.
Ultimately, we wanted to provide HPC-like semantics on Kubernetes by implementing job submission queues for a more streamlined experience.
Bart: And you mentioned adopting open source solutions rather than building custom tools. What technology stack did you end up choosing from the CNCF landscape?
Alessandro: The CNCF landscape is really broad. There's a tool for everything now. Even opening the page can lag your computer, no matter how powerful. But that's the power of open source. Whenever you want to do something in a Kubernetes cluster or in Cloud Native, check the landscape before trying to create something from scratch. By using an established open source tool that has been around for a while, you can stand on the shoulders of giants. There's proven experience, development, extensive usage, and guides that can help you get started quickly.
We started with Argo CD to power our GitOps pipeline. Along with Flux, it's probably the most common GitOps tool. In our case, it's very easy to install because we're using OpenShift, and Red Hat packaged Argo CD as an operator on Operator Hub, so you can do a one-click install. It also has a GUI that makes tracking things easier than using the command line.
For enforcing policies, we looked at both Kyverno and Gatekeeper. We didn't go with Gatekeeper because it uses its own language called Rego for building policies. As researchers managing OpenShift clusters is not our primary job, we wanted something that truly speaks Kubernetes. Kyverno has numerous example policies and uses YAML, making it very easy to get started. It's also powerful. However, for those with specific requirements, always check what tool works best for you—there's no one-size-fits-all solution.
Finally, we chose Kueue for managing HPC-like job scheduling and queuing. It's developed by one of the Kubernetes special interest groups, which provides good backing. It also offers flexible resource semantics that we really appreciate.
Bart: And GitOps, you mentioned this now a couple of times, but we've noticed through our work that it's becoming increasingly popular for configuration management. How did you implement this approach with Argo CD to manage your research projects?
Alessandro: GitOps is becoming very popular. For people on the fence about GitOps, I highly encourage watching one of the talks from GitOpsCon Europe about a month ago by Ryan Edson from Red Hat. He discussed the pros and cons of adopting GitOps, such as the perceived YAML fatigue and potential loss of control. However, he showed that even Fortune 500 companies choose GitOps after proof-of-concept systems because they see a good return on investment.
In our specific case, we wanted to reduce the time spent manually intervening on clusters for simple actions like adding a person to a group, changing resource quotas, or provisioning a project. We also wanted to ensure all projects had a consistent base kept up to date with our latest best practices, as our previous Helm charts weren't maintained.
We chose a GitOps approach and set up a single source of truth repository, converting our Helm charts into Kustomize manifests. This was a personal preference, but I believe Kustomize helps both expert and non-expert users understand what's happening without complex templating. It enables a LEGO-style approach to creating projects and manifests, allowing base configurations with add-ons for specific permissions.
We leveraged Argo CD and application sets to maximize our workflow. Application sets allow automatic Argo application creation using generators. We used the Git directory generator, which creates an Argo application for each project in a specified folder. This ensures each project inherits our base configuration and can be easily updated.
Argo's auto-sync and self-healing mechanisms cascade changes and reconcile any manual modifications, encouraging users to follow the proper change management process.
Bart: Alessandro, one of the key issues that you highlighted was GPU resources being monopolized. How did Kyverno policies help address this problem?
Alessandro: GPUs are definitely one of the most scarce resources in any kind of clusters. You can have tens or hundreds of nodes, but a limited set of GPUs. Especially when dealing with LLMs that take up often multiple GPUs to be loaded into memory, these become really scarce.
What we observed was that people were creating pods using commands like sleep infinity
or tail /dev/null
to create an idle pod, and then using SSH to log into the pod and use it as an unofficial VM. This behavior, especially near conference deadlines when everybody's trying to access GPUs, was causing us tons of problems. Everybody would reach out and ask us, "Why is my pod in a pending state? I need to run my stuff because my deadline is upcoming."
When thinking about how to deal with this, we initially considered preventing people from launching these idle pods by blocking commands like sleep infinity
. But we realized this was just a symptom of the problem of using pods interactively. Kyverno, as I mentioned before, has a very rich set of example policies. One of the policies in the examples was about blocking pod executions. We took that policy and made a few changes to enforce exception mechanisms and allow cluster admins to log in and troubleshoot.
We reached out to users, informing them about the upcoming policies and providing guidance on transitioning to a more Kubernetes-native approach. We also introduced monitoring tools. Kyverno helped with this by providing example policies that audit pod executions. This allowed us to see which namespaces, users, and commands were being used.
Initially, we had a very strict policy where pod executions were only allowed for cluster admins. However, we discovered this was too restrictive, as it prevented users from running commands like oc cp
or oc rsync
. Through troubleshooting and engaging with users, we added exceptions—for example, allowing pod executions on non-GPU-enabled pods, which have plenty of room in our clusters. Monitoring was crucial in this process.
Bart: So, beyond the exec restriction, what other policies did you implement to optimize GPU resource utilization?
Alessandro: In addition to exec restrictions, we tried to set reasonable defaults for the number of GPUs that each namespace could use through resource quotas to avoid cases where one team would monopolize all the GPUs in the clusters.
Monitoring plays a very important role because there are times when GPU scheduling is in high demand, especially near deadlines, where you might want to be more strict. At the same time, when demand is low, you want people to be able to run their GPU workloads at full power. With monitoring, you can choose to increase or decrease resource quotas based on usage.
We also tried to tackle the issue of large batch jobs being scheduled on GPU nodes. Previously, people were asked to use affinity rules to avoid scheduling large batch jobs that weren't using GPUs, but they would often forget. Kyverno's mutating policies allow us to inspect resources being submitted to the cluster. Based on the resources required, we check whether any containers are using GPUs. If not, we automatically add affinity rules and annotations using Kyverno.
We found this approach particularly effective because it runs transparently for users and isn't disruptive. Users who would forget simply don't need to care because we, as administrators, fix it for them.
Bart: Fair sharing of resources is crucial in a multi-tenant environment. How did you leverage Kueue to provide equitable GPU access?
Alessandro: Our clusters evolve over time. We have bare metal clusters, so we can have additional nodes being added, and resources change over time. For example, we had T-Force on the cluster in terms of GPUs, then we had D100, and then added A100 and H100. We wanted to make sure people could access GPUs in a fair way, considering two types of use cases:
Users needing a specific type of GPU with particular CUDA power or memory requirements
Users who simply want to access any GPU quickly
By using Kueue, we defined resource flavors and cluster queues for each GPU type. This allowed people to access a specific type of GPU required by their workload by submitting to that queue. Additionally, we took advantage of Kueue's mechanism where queues in the same cohort can borrow resources from each other.
In addition to defining queues for every single GPU type, we created an additional cluster queue and queues for local namespaces with no resources attached. These queues would borrow resources from other queues. For instance, if the H100 queue was fully utilized, submitting to the generic GPU queue would route the request to an available A100. This approach worked really well for us.
Bart: Many organizations struggle with balancing flexibility for researchers and governance controls. What has been the impact of these implementations on your research environment?
Alessandro: Trying to strike a balance depends on many factors, such as the number of users, how much you trust them, and their resource usage patterns. Monitoring is very important for that. In our experience, the changes we discussed benefited both admins and users.
Users have been able to access their resources with far fewer problems and avoid the long feedback loop of asking us on Slack to do things for them. Now they can simply submit a PR to our GitOps repository with the changes they need. We review the PR, check that everything is done correctly in the right namespace, and then approve it. Argo deploys the changes within seconds of our approval.
For admins, the administrative burden has essentially vanished. We can focus more on research and step in only when actual issues arise. We can also be sure that, by construction, every project adheres to our best practices in terms of quotas, policies, and limits. We don't have to worry about anything when people provision their new projects. Essentially, everybody wins.
Bart: That's great. The article you wrote focuses primarily on technical solutions. But in many of our conversations when we're speaking about the use of Kubernetes, we often run into organizational challenges that can be very important. How did you build support among leadership and ensure alignment with broader research computing strategy in order to get this done?
Alessandro: Hybrid cloud, Kubernetes, and OpenShift have been organizational priorities for IBM for a while. One of the goals that IBM Research and accelerated discovery have had is to look at how Kubernetes native approaches can be adopted for hybrid workloads of HPC and AI.
Enabling this meant we were aligned, so we didn't have to get too much support from our leadership. In our specific case, the clusters had been there for a few years, and we wanted to explore how adopting new technologies could help provide a smoother experience for everybody involved, both users and admins.
We had to make sure that all the admins were in favor of these changes and were able to monitor how these changes affected the users and the admins. Thankfully, measuring these changes was very easy because the number of support requests dropped drastically.
This type of governance required some upskilling in the admin team because not many people were familiar with GitOps. We also had to have clear communication with our users, keeping them in the loop with everything we were doing and making documentation available to them. This meant users had to change their way of doing things.
We engaged them throughout our process, letting them know what actions we were going to take. We applied policies on a subset of users and monitored their progress, checking for any clear problems before applying changes to everybody. We gave them many examples to make everything smoother.
Attending conferences like GitOpsCon really helped us and validated our approach, as we heard more and more people adopting the same strategy.
Bart: For DevOps teams looking to implement similar governance in their Kubernetes environments, what key lessons or advice would you share?
Alessandro: There isn't a one-size-fits-all solution.
Bart: I want a book. I want something on a t-shirt.
Alessandro: I would say we should start by checking and ensuring a clear understanding of what you want to achieve. Try to leverage as many open source tools as possible, because it's proven technology most of the time. Choose wisely and start with a smaller proof of concept that you can share with other stakeholders to gain their support.
One important thing is that admins need to be very empathetic with their users because they might not know better. You're probably the most knowledgeable people in your team. Make governance as transparent as possible. In our case, it was the addition of affinity rules, but make sure to tell your users what's happening so they aren't surprised.
Adopt monitoring mechanisms to see when things go wrong, because especially at the start, they probably will. Be ready to support exceptions because there are cases where a policy cannot apply. Use automation as much as you can and invest in it early to get the most return. Set up sane defaults and create add-ons with Kustomize, your preferred tool like Helm, and use GitOps all the way.
Bart: Very good. What's next for Alessandro?
Alessandro: We will try to get more engagement with communities like Kyverno, and we're also thinking of open-sourcing some tools we've been working on for large-scale experiments and benchmarking on Kubernetes. Stay tuned for that.
Bart: Excellent. And when you're not doing all this amazing work, what do you do in your free time?
Alessandro: I like going on hikes. I'm heading to the Dolomites in about a month, so I'll get some nature away from all this tech.
Bart: Sounds like a great plan. If people want to get in touch with you, what's the best way to do that?
Alessandro: ` tag as instructed
Bart: Thank you very much for sharing your time and knowledge with us today. I look forward to speaking with you soon. Take care.
Alessandro: Goodbye.