SaaS with Kubernetes Operators and Garbage Collection

SaaS with Kubernetes Operators and Garbage Collection

Apr 28, 2026

Host:

  • Bart Farrell

Guest:

  • Alexander Held

A single Kubernetes CRD for every service request turns small changes into full-platform reconciliations.

Alexander Held, former platform engineer at Mercedes-Benz Tech Innovation, describes a production refactor from a 2,000-line CRD to purpose-built resources and controllers. He shows how teams can model business workflows as Kubernetes APIs and then use owner references, finalizers, and events to keep platform operations predictable.

You will learn:

  • Why monolithic CRDs create performance and troubleshooting problems

  • How controllers turn database provisioning and backups into reconciliation loops

  • How finalizers clean up external resources such as S3 backups

  • Why Kubernetes events make platform workflows easier to debug

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Transcription

Bart Farrell: In this episode of KubeFM, we're joined by Alex, a freelance engineer and former platform engineer at Mercedes-Benz Tech Innovation. This episode is a deep, practical look at building and operating Kubernetes platforms at scale, based on real production experience supporting hundreds of internal teams. We focus on operator-driven architecture, breaking down a monolithic CRD into multiple purpose-built resources, using controllers to encode business logic, and relying on owner references, finalizers, and events to manage lifecycle cleanup and observability in a multi-tenant environment. Alex walks through concrete examples, including database provisioning, backup workflows, garbage collection, and why CRDs are both Kubernetes' most powerful and most dangerous feature. If you're designing internal platforms or running Kubernetes as a service, this episode goes deep into the mechanics that actually make these systems operable over time. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through Kubernetes courses. They are instructor-led and are 60% practical, 40% theoretical. Students have access to course materials for the rest of their lives. They are given in-person and online to groups as well as individuals. For more information about how you can level up, go to learnkube.com. Now, let's get into the episode with Alex. Now, Alex, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Alexander Held: Hey, Bart. So, right now, that must be Flux. Talos is really promising. I have like a love-hate relationship with it right now. And Falco.

Bart Farrell: And Alex, for people who don't know you, who are you and who do you work for?

Alexander Held: So full disclosure first, right now I'm free and freelancing and I'm basically just working for myself. That means I don't work for Mercedes-Benz Tech Innovation, which we'll be talking about a little bit today. And my thoughts and opinions are my own. not of them. Right now, I'm developing with a small team a low-cost cloud provider called Vyrt Cloud, where we want to provide managed Kubernetes services for under five euro and just regular VM boxes for even under one euro plus IP addresses. Another project I'm working on is called Classify. It's an AI document classification and workflow engine where you can ingest documents. from various places and execute engines on them. Stay tuned. They will be released soon.

Bart Farrell: Looking forward to that. Maybe we'll have to have another podcast episode about those too. That sounds great. Alex, how did you get into cloud native in the first place?

Alexander Held: So actually, I started my professional career kind of there. I started in a microservice team using Azure Service Fabric. And we built a decentralized ERP system for a medium-sized company. It kind of worked until it didn't. We had some scaling issues and it became pretty obvious for the ecosystem to move in a different direction than Azure Service Fabric. We eventually adopted Kubernetes and I think that was the right decision at that time. And then I just continued over there.

Bart Farrell: Okay. And what were you before getting into Cloud Native?

Alexander Held: I started programming when I was nine years old. My parents were both in the IT and coding was normal for me. And I think I started with C#. And then like during school, I did some Python, later some Swift. But then eventually I realized it's not just a hobby. And I started the job in the .NET field, C#.

Bart Farrell: Very cool. Wow, so you had a very early start. Alex, the Kubernetes ecosystem moves very quickly. How do you stay up to date with all the changes that are ongoing? What resources work best for you? And if you could go back in time and share one career tip with your younger self, what would it be?

Alexander Held: Well, I would say keep the complexity low, Alex, because back then I really preferred those fancy, shiny solutions, especially in .NET contexts or Java, it's really easy to get distracted by all those generics and factories and all those patterns. And then you end up like writing complex code where it's not even necessary. And most often they don't. The future me paid the price for that. So today I prefer like a more simpler approach because that scales just better mentally.

Bart Farrell: So as part of our monthly content discovery, we found an article that you wrote titled From Dumpster Fire to Sparkling Clean, SaaS with Kubernetes Operators and Garbage Collection. I must say this is probably the coolest title we've come across in our over 80 episodes that we've done this podcast, but we want to dive into this topic a little bit more. So Mercedes-Benz Tech Innovation has been offering managed services on Kubernetes since 2019, now serving over 500 projects. Can you tell us about the platform's evolution? And what drove the need for a major architectural change?

Alexander Held: Sure. I think by the time we're speaking right now, that might be far over 500 projects right now. But back then, we had basically two teams. One team was more like a managed Kubernetes for in-house at Mercedes-Benz. And then our team was built on top of it. And we provided those managed services, for example, Postgres, MongoDB, like the ELK Stack or Prometheus, Grafana, all this kind of stuff that they need, just need to work properly in Kubernetes. And for all these different services, we had an operator, one operator and one CRD, where we, when the customer said, ah, I need like a 50 gigabytes Postgres database, every information would go into that CRD. And it was quite huge. It was extremely huge. I think if you unfolded everything in your code editor, then it would be like 2000 lines of code, just YAML and sometimes repeating itself. And it was like a mess. So we thought about splitting that up. And in the end, it turned out to be a really good decision.

Bart Farrell: So you mentioned having one custom resource definition or CRD, that contained all services per project. Which sounds like it could become unwieldy. What specific problems did this monolithic approach create for your team and customers?

Alexander Held: I mean, the performance issues are obvious. Like if there is one change in one service, the whole CRD gets reconciled, which means that our operator will go through every service that they ordered and sometimes there are like five databases, one monitoring, one logging, and we would try to see what we need to do on the cluster. And it would take a couple of seconds for this to complete. And having like 500 projects, that would stack up a lot. Just for small changes we had like extreme effort like on the performance side. It is energy-intensive, expensive. and not really necessary. So we had like lags for the reconciliation. But also on the operations side, it kind of gets confusing sometimes when the reconciliation happens in parallel and there are so many, it's hard to correlate sometimes what goes where and what the system is trying to do, especially when there are errors. Also, it took a lot of time at the end to implement even small features because it was so bloated. We had thousands of lines of code in this operator and different abstractions everywhere. And it worked. Our customers were happy, but the troubleshooting side was getting more challenging every day. So we try to go from this one CRD to multiple ones.

Bart Farrell: And for those that are less familiar with Kubernetes operators, you use the controller pattern extensively in your solution. Could you explain how this pattern works and why it's so fundamental to your architecture?

Alexander Held: Well, the operator pattern is basically. There are three stages. There's observe. part where we read declaratively what should be there on the cluster. For example, there should be one Postgres instance. with like 50 gigabytes of storage and the name elephant, just an example. Then we have a look at the cluster itself. Is there a Postgres cluster? Is the stateful set deployed? Are there volumes in the correct size? All the configuration that should be there, is it there? And if not, what's the difference? This is the second stage. We basically calculate actions that we need to do. And then the third stage of the cycle, we act upon those changes and we deploy a stateful set or resize the volume if it changes. And then it just repeats. So we have the cycle of observing, making differences, or calculating differences, and then resolving those. Traditional apps react to requests. You say you want to increase the volume by 50 or 250 gigabytes. If that fails, you need some retry and you need to cascade it back to the customer. If it doesn't work, what do you do? No, you need to take care about all those things. And controllers are there to help you to enforce reality. And the reality is what you describe declaratively as YAML in Kubernetes.

Bart Farrell: Your architecture involves a workload cluster model. where you run databases for customers who can bring their own Kubernetes clusters. How does this multi-cluster approach work and what challenges do you need to solve?

Alexander Held: So that's a wonderful point. And we have those workload clusters, which basically contain all the controllers, all the CRDs, which basically mean intent or configuration, and all those service instances we were talking about earlier. For example, Postgres clusters, MongoDB, some backups, maybe even jobs that we need to run. All those exist on the workload cluster. This is where the actual programs are running, the services are running. And then we have the customer or remote clusters. Customers can have multiple ones and we have a special role on them that gives us access to them because They're from a partner team, so we had the opportunity to be like silent administrators. So silent was known, but we could bootstrap ourselves or install logshipper, for example, if the customer ordered logging, or we could create services and private endpoints so that the Postgres instances that are running on the workload clusters could be accessed by local DNS in the in the customer or the remote clusters.

Bart Farrell: Let's dive into a concrete example with your Postgres operator. Walk us through how a customer request for a new database translates into actual Kubernetes resources.

Alexander Held: Basically, we have like a front-end, this UI. It was internal to Mercedes-Benz, where customers could say, I want to buy a new Postgres database. And they configure everything. they want in the UI. And then this request got forwarded through a couple of backends to some place where we created a custom resource definition for it. Basically said, PostgresCluster was a kind, the kind of CRD, and let's go with the name elephant. And the database was called elephant. So we created that CRD in our workload cluster and specified all the configurations that the user put in the front-end. So this is basically one-to-one mapping of the configuration that the user needed. That was the flow and the response that got cascaded back to the customer. This was all the user interaction part. What happened in the background then was our Postgres operator would pick up that there's a new custom resource definition of kind PostgresCluster which signaled intent. We wanted a new PostgresCluster here with that configuration. So we deployed some Helm charts and everything that was necessary to make the PostgresCluster run on the cluster.

Bart Farrell: One interesting aspect is how you've translated business requirements directly into custom resource definitions. Can you explain your backup system design and how different CRDs represent different aspects of the backup workflow?

Alexander Held: So the backup workflow was interesting because we had various CRDs. Some of them are, for example, Postgres. I will skip the Postgres prefix from now on. They were all prefixed with Postgres. So there was the backup requests, which signaled intent that the backup should be made. and there was a backup itself. which contained information about an already finished backup. For example, we stored them on an S3-like object store. So where's the path, what encryption do we use, some signatures and all this stuff when it happened. And then we had the schedule, which was used for system-planned backups. And users could create a backup request on demand to now create a backup, or we would schedule them like every couple of hours with the user configured. Also, we had like a system backup schedule which will trigger anyways regardless of what the user configured. So we had like this interesting CRD structure and they were all a reflection of business processes, right? So user can trigger a backup. We need some system backups. User can define a period for backups and we translated them into CRDs.

Bart Farrell: The backup process involves multiple steps from requests to S3 storage. Could you walk us through what actually happens when a customer triggers a backup?

Alexander Held: So let's say the user clicks on in the front end on the button. Hey, I want to create a backup now for elephant. Our API would basically just create Postgres backup requests for the cluster elephant. And our Postgres operator has different controllers. One of them would be the backup request controller, which will pick up. Like, you remember the cycle, Bart? So it will observe there's a new backup request, it will say, okay, is there already a job deployed for the cluster elephant with a backup? No, there isn't. So it would deploy a job. This job would be just another Docker container, another pod, which will be configured with all the necessary credentials and environment information for the elephant cluster. And this would be the end of the operator cycle. If you deploy the job, it's happy and then the job would pick up from there. It would create a backup and even for large databases we would continuously update the status of the Postgres backup request. Let's say like it's 10% done, it's 20% done and even if there is an issue we would mark that in like an events section where you can put like status events. continuously update the status. So we knew what was going on. And when it was finished, we were completed. We set the status to completed, filled all the values of the Postgres backup and created that CRD, which would mark there is a backup, it's on the S3. This is a signature. We use that kind of encryption. Here you go. And this would be then available in the front end. because it was available in the cluster, the API could list it there and the user could restore it from there.

Bart Farrell: You leveraged PostgresBackupSchedule for automated system backups, which seems like a clever reuse of the same mechanism. How does the scheduling system work and how do you ensure reliability?

Alexander Held: So the backup schedule, I think it was kind of a clever solution because we could just reuse what we had. There was not many lines of code to have the scheduling over. We just had a controller for another CRD, the backup schedule. And then the backup schedule we had like when the name is system it couldn't be deleted and all those security mechanisms for us. But basically when it got reconciled it would check What are the options there? Is it like every six hours? Is it every two hours or every 24 hours? And what is the time right now? And when using controller runtime, you can choose when to reconcile next. So you could calculate when should I reconcile. And if we had like this window of one hour, when we are inside this window, then we would create like a PostgresBackupRequest. So every time we checked, should we create a backup now? No. Then sleep a little. And then at some point, we create the backup request, which will trigger the flow that we discussed earlier.

Bart Farrell: Owner references are a powerful Kubernetes feature that you use extensively for lifecycle management. Can you explain how this helps with resource cleanup and why it's particularly important for a multi-tenant SaaS platform?

Alexander Held: Sure. owner references are everywhere in Kubernetes by default. Like every pod that you do or that gets created by a stateful set has an owner reference to it and describes parent and child relationships. And they're important when you delete something, for example. In our example with the elephant Postgres instance, we wanted to delete all backups at some point. when we delete a Postgres cluster because we don't want to store or have information about the backups for clusters that have already been deleted for extended periods of time. That would just be garbage in our system. It would cost money. It's not helpful and not secure. We decided to leverage this concept that is already Kubernetes native of owner references. and gave that to every other CRD. So when we create a Postgres backup request or a Postgres backup or a schedule or anything else, they would all have a reference to the elephant Postgres instance as a main, as a root, as a parent. And when we delete that, it would cascade the deletion to every other resource that is related to it. This helps preventing orphaned resources and helps us keep track that the clusters are clean because you don't want to check manually or have jobs to check if the resources get destroyed, especially with backups or other stuff. You just want it to work out of the box. Kubernetes helps enforce this at the platform level.

Bart Farrell: While owner references handle Kubernetes resources, you still have external resources like S3 backups to manage. How do finalizers solve this problem and what happens during the deletion process?

Alexander Held: That's a good catch. So, of course, Kubernetes will help us delete, like, when you delete the StatefulSet, it will delete the pod. But when, in our case, with our custom resources, and we have for the backups, we have the backup. that can be deleted. It's just a YAML manifest. Kubernetes delete can delete that instantly. But we have some real actual data stored somewhere else. And we need to delete that too. And inside our operator, we can check if a resource is marked for deletion by the Kubernetes runtime. And if that's the case, we check if there is a finalizer still there in the resource. You can think of finalizers like labels, basically. They're a list of strings. And if, for example, the backup should be deleted and there's still the S3 deletion prevention finalizer there in our controller, we check that. It should be deleted. The finalizer is still there. Then we go to the S3, to the bucket, and delete the backup. If we've done that, we remove the finalizer from the finalizers array. And if there are no finalizers anymore, Kubernetes will proceed with the planned deletion of the resource. So we can have this deletion blocker while external resources are pending to be deleted.

Bart Farrell: Events in Kubernetes are often underutilized, but you mentioned they've helped you find bugs and deadlocks. How do you use events for troubleshooting? And what best practices have you developed?

Alexander Held: I mentioned we use events for providing some status reporting while doing a backup, for example. They provide human-readable context that we could either display the user at some point, for example, for the how many percent are we done yet for the backup, but sometimes also some errors that are occurred or warnings or other things that we might need that are scoped to an individual resource. Of course, we have logs that we can use in a platform, but sometimes it's nice to see it directly in your cluster, grouped to the resource itself. You can use kubectl describe to watch the events that happen to this resource. and you can just get or edit the resource directly in the terminal. And this makes a great ops workflow, at least from my opinion. It can create an audit trail if you keep consistent with your events and have a team commitment to always implement features that leave events there when you modify a resource. And it even helped us. find a lot of bugs, to be honest, like timing issues or why something got deleted first because reasons. So they create a timeline of what actually happens, not only what the code thought it would be.

Bart Farrell: This kind of operator-native architecture seems complex to implement initially. What advice would you give to teams considering a similar transformation of their Kubernetes platform?

Alexander Held: I think I would start small. Maybe just pick the easiest operator you can think of. Best case scenario, you have one specific business domain that you can try to automate first. And the important thing is to think about your custom resource definitions. What is your main custom resource definition? What is your main use case? in a business point of view. Don't be afraid of having too many CRDs. I think that's not possible. Just be afraid of the wrong ones. It's like class design and object-oriented programming. Don't overcomplicate things. Have it small and do it one purpose only, and then I think you're golden.

Bart Farrell: You mentioned this architecture made it easier to support the platform 24/7. Can you elaborate on the operational benefits you've seen since the refactoring?

Alexander Held: Well, the refactoring took a long time. I think we were there like over one and a half years where we had split up this huge CRD into various operators and with each time many controllers. So one operator can have many controllers itself. And we learned along the way and improved our design techniques. But in the end, the on-call got a lot quieter. Debugging got faster. And we could even implement new features more. Sometimes it was just create a new controller. We don't need to touch anything new. Sorry, anything old. We could just start new with fresh business requirements, which is nice. could reduce manual cleanups using the owner references. And the debugging got really easy with all those events that we were writing. And we had clear ownership about which team or which team members owned which part of the system. For example, some people would go help more with the Postgres or the other people were on the monitoring.

Bart Farrell: Looking back at this journey from a dumpster fire to sparkling clean code, what were the most important lessons learned and what would you do differently if starting over today?

Alexander Held: We learned that business logic fits surprisingly well into CRDs if you respect all the patterns that are Kubernetes native. So if you handle deletion, using owner references, if you use finalizers for external dependencies. then it works quite well. And I think that's the way it's supposed to be. Even if you look at different bigger operators, which are many out in the world, they leverage that. And it's working for them and it worked for us. And I think I'd invest earlier in operator design, like in the thought about which CRDs are our main CRDs, which are the core stuff and which are like more individual features and try to avoid clever shortcuts. They come always back to haunt you.

Bart Farrell: So you did mention that there is the phase of this being a dumpster fire. In that phase, can you estimate how many times was the word Scheiße used?

Alexander Held: Like in total?

Bart Farrell: Just at the top of your head. if you had to guess.

Alexander Held: I think like at least Hundreds of times.

Bart Farrell: Hundreds, okay. That's fair, good. Any other German words that we should know if we're in a dumpster fire situation apart from Scheiße to diversify?

Alexander Held: I would rather not speak them out loud.

Bart Farrell: That's fine, that's okay. That's for another conversation another day. Now, I do have a more serious question. In the last KubeCon, we interviewed a lot of different people about their favorite and least favorite Kubernetes features. And we got a chance also to speak to Kelsey Hightower. And he said that both his favorite and least favorite Kubernetes feature are custom resource definitions because they allow you to do whatever you want, both in the positive sense and also the negative sense. Do you agree with that? And would you also say that CRDs are your favorite feature, possibly your least favorite feature? What are your thoughts on that?

Alexander Held: Well, I think, I mean, how could I not agree with Kelsey? So, I think he's seen a lot more Kubernetes clusters than I did. But still, I think they're the greatest feature of Kubernetes because they really enable an extensible platform. And of course, there are many downsides of giving so much freedom to developers and platform providers. you can break stuff. You can make it more complex. You can shoot yourself in the foot. But I think the pros outweigh the cons far more. So I love CRDs. Big fan. And also big fan, Kelsey.

Bart Farrell: Good, thank you. Also, because this is not the first time that we've spoken about Postgres operators in this podcast, I just want to know, in terms of the other Postgres operators that are in the landscape, were there any that came up as possible options for this project you were working on? Whether, you know, we're talking about, if we're thinking about Zalando, CloudNativePG, StackGres, the Crunchy Postgres Operator, KubeDB, were there any of these that came up as potential alternatives? Were there any reasons as to why they couldn't work for this specific use case? Can you tell me a little bit about that?

Alexander Held: Yes, some of them were in discussion, but unfortunately, I think I'm not allowed to talk about this.

Bart Farrell: That's totally fine. That's fine. I guess I'm just getting more curious more broadly. We have a lot of operators for Postgres. And also talking about Postgres extensions, kind of relating to what we were talking about with CRDs. How do you see that landscape when it comes to Postgres operators in general for Kubernetes? Is it something that catches, that's got your attention a fair amount? What would you like to see develop in the future in terms of, you know, we've now got CloudNativePG as a sandbox or incubating project in the CNCF, you know, different things along those lines. What do you think is going to happen next when it comes to Postgres and Kubernetes?

Alexander Held: I have no idea. I have no idea, but I'm using CloudNativePG personally in my projects. So I think they're great and doing a great job. I haven't tried the other ones yet. CloudNativePG just worked for me and there was no need to.

Bart Farrell: Fair enough. Now, looking also towards the future, what's next for you, Alex? I know you mentioned a couple of projects in the beginning. If you want to mention them again, that's totally okay. If there's anything else that you'd like us to know about what you have on your radar for 2026?

Alexander Held: I think 2026, I will stay focused on releasing my projects. They're incubating right now, too. And I will see where this will lead me, to be honest. Maybe I want some enterprise grade teams with thousands of Kubernetes clusters. Maybe I want to build myself my companies. It really depends what 2026 is bringing for me, but it will be an interesting journey, I think.

Bart Farrell: I have no doubt at all. And I really hope we get to have you back on the podcast. If people want to get in touch with you, what's the best way to do that?

Alexander Held: I think right now it would be LinkedIn. Boring, but it works. I will post there about my projects soon. And if you want to, How do you say in LinkedIn? Follow? Connect, right? Connect. It's Alex Held Consulting.

Bart Farrell: Very good. Well, Alex, thank you so much for joining us today. Really enjoyed hearing about your experience and look forward to future conversations. Take care. Thank you.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via