From Fragile to Faultless: Kubernetes Self-Healing In Practice

Host:

Bart Farrell

Guest:

Grzegorz Głąb

This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Discover how to build resilient Kubernetes environments at scale with practical automation strategies from an engineer who's tackled complex production challenges.

Grzegorz Głąb, Kubernetes Engineer at Cloud Kitchens, shares his team's journey developing a comprehensive self-healing framework. He explains how they addressed issues ranging from spot node preemptions to network packet drops caused by unbalanced IRQs, providing concrete examples of automation that prevents downtime and improves reliability.

You will learn:

How managed Kubernetes services like AKS provide benefits but require customization for specific use cases
The architecture of an effective self-healing framework using DaemonSets and deployments with Kubernetes-native components
Practical solutions for common challenges like StatefulSet pods stuck on unreachable nodes and cleaning up orphaned pods
Techniques for workload-level automation, including throttling CPU-hungry pods and automating diagnostic data collection

Relevant links

Transcription

Bart: In this episode of KubeFM, we're joined by Grzegorz, a Kubernetes engineer at Cloud Kitchens, to break down the complexities of running Kubernetes at scale. We'll discuss self-healing frameworks, multi-cluster management, and optimizing managed services like AKS. He shares how his team mitigates spot node preemptions, automates failure recovery, and deals with issues like unbalanced IRQs causing network packet drops. We'll also examine workload-level automation from CPU-throttling noisy neighbors to capturing heap dumps for debugging. Additionally, we'll explore how Kubernetes' default handling of unreachable nodes impacts stateful workloads and how custom automation can prevent downtime. Finally, we discuss optimizing cluster stability by proactively cleaning up orphan pods and tuning Kubernetes scheduling behavior for high-efficiency environments.

This episode is powered by Learnk8s. Since 2017, Learnk8s has been helping Kubernetes engineers level up all over the world. From Fortune 100 companies to small startups, Learnk8s provides training in person or online to individuals as well as groups. Courses are instructor-led, 60% practical and 40% theoretical, and students have access to course materials for the rest of their lives. For more information, check out Learnk8s.io.

Now, let's get into the episode. So, first question: What are three emerging Kubernetes tools that you are keeping an eye on?

Grzegorz: For me, Coroot is an emerging open source tool for doing APM stuff like New Relic, but completely open source. I really enjoy seeing them grow because I've been experimenting with it. I gave some feedback to the team, created some issues on GitHub, and I see how they are improving it from version to version.

The second thing is the SIG Multicluster and, in general, cluster inventory. Many places need to focus on managing multiple Kubernetes clusters, like with hub-and-spoke architecture or similar. I see repetitive patterns where many companies and people redo this kind of cluster management from a single place, so I'm really interested in how it can simplify these use cases.

The last thing I'm looking at is WASM, which is an interesting technology. I see a lot of potential there. However, I'm more curious to see how the actual production use cases look like.

Bart: Can you tell me about who you are, where you work, and what you do?

In this case, I would hyperlink the following:

Cloud Kitchens - the company where Grzegorz works
Cassandra and Kafka - technologies mentioned in the background
OpenStack - another technology in the speaker's background

The transcript doesn't require extensive modification, but these links provide additional context about the speaker's professional background.

Grzegorz: I'm a Kubernetes engineer at Cloud Kitchens, and my day-to-day job is creating a reliable, scalable, and cost-efficient platform for the company. I work to enhance the platform's visibility, reliability, and automation while keeping cost and efficiency in mind. Additionally, my team provides technical expertise on Kubernetes and cloud-native technologies to the rest of the company.

Bart: And how did you get into cloud-native?

Grzegorz: I started as a Java developer, and a few years ago, I had a chance to transform an application running on legacy OpenStack to run on our new Kubernetes platform. I really enjoyed this and learned a lot about Kubernetes during that time. I wanted to stay in this area because it was much more appealing to me than writing Java code.

Later, when I had the opportunity, I switched to the newly created tooling team, focusing on developing tools around our Kubernetes platform, mostly including Prometheus and writing some first operators in Python. As the team's focus shifted to be more product-specific, I was looking for a chance to stay in the Kubernetes area. I found a role at CloudKitchens, where I'm currently working.

In the meantime, I took over an abandoned open source project called MetaController and am still maintaining it in my free time. However, I don't have as much free time for this as I would like, but I'm still keeping an eye on it.

Bart: What were you before Cloud Native?

Grzegorz: I was a Java developer and started in 2011. I worked on a variety of projects from desktop applications related to testing, code coverage, and unit generation for Java code. Later, I worked on the integration layer between old and new systems using Cassandra and Kafka. To be honest, I don't really use that part now. It was valuable experience, but working in platform and cloud-native technologies is much more appealing to me.

One thing I learned over time is that being in platform engineering gives you a much bigger impact area. You can reach more people, and by working on platform components, whatever I do here can impact the whole application stack, which is really amazing.

Bart: Excellent. Now, the Kubernetes ecosystem moves very quickly. How do you keep updated? Do you read blogs, watch videos, listen to podcasts? What works best for you?

Grzegorz: As this is a very broad ecosystem, I mainly focus on reading Kubernetes release updates to know what's new coming to Kubernetes. I try to stay on top of the Learnk8s Kubernetes newsletter, as well as a few Kubernetes-related channels, usually related to controller runtime or KubeBuilder. I also follow a few Kubernetes SIGs like SIG Multicluster and SIG Node, mainly just reading what people wrote there. It's a really amazing experience that gives you a lot of visibility on what's happening under the hood. I also try to stay updated on what's happening at KubeCon, and when I cannot go in person, I look at the agenda and pick interesting talks to watch later on YouTube.

Bart: If you had to go back in time to share one piece of career advice with your younger self, what would it be?

Grzegorz: For sure, don't stop at the documentation. Do more exploratory and experimental work. Go through open source code, try experiments, and look for opportunities to contribute to open source projects because it's a great experience. It's rewarding to know that your work is used by other people.

As part of our monthly content discovery, we found an article you wrote titled "From Fragile to Faultless: Kubernetes Self-Healing in Practice". We wanted to explore this further. Many people say that using a managed Kubernetes service like AKS makes everything easier. In your experience, is this really the case?

Bart: Using managed Kubernetes services like AKS makes things easier because it takes a lot of work needed for setting up and operating a cluster. However, it also comes with its own set of challenges. When we moved to AKS, we focused on making our setup easier. For example, we used open-source projects like Cluster API (CAPI) and CAPZ (Cluster API for Azure) to automate our process, which really sped up the pace of onboarding new clusters.

However, we had to make contributions to CAPZ and CAPI to enable more customizations and fix bugs that impacted our environments. Managed services like AKS need to support a variety of use cases, so they typically tune to default settings that might not be suitable for every scenario. On our side, we needed to customize the cluster autoscaler and create a custom scheduler for better node bin packing to be more resource-efficient.

For our multi-cluster setup, we brought in a manager to handle node pool updates and make the process easier and more reliable. Despite these efforts, anyone using managed services is vulnerable to issues from underlying infrastructure or cloud providers. While bugs and incidents can happen to anyone, ensuring business continuity often means having control in your own hands.

Cloud providers are continually improving their operations, which reduces some issues and enhances monitoring across different areas, making observability easier. However, at some point, you need to work with the cloud provider because you don't have direct access to components like the ETCD server. During one incident where the process was filling up ETCD, we needed the cloud provider to remove objects directly from ETCD, which we couldn't do on our own.

Some decisions by cloud providers like Azure directly affect us and are hard to work around. For instance, we discovered that when you have an App Gateway connecting to an AKS cluster, Azure networking classifies the Kubernetes API server's public IP as an external public IP. As a result, traffic generated by Kubernetes system components or controllers using the watch pattern is classified as external internet traffic and charged accordingly—which is not inexpensive.

Another challenge is the lack of control over beta features in the cluster. If a beta feature isn't enabled by default in Kubernetes, you cannot simply enable it in Azure. You must wait until the feature graduates, which is not optimal. We would like more control over these settings.

Grzegorz: Sure, not all issues are directly related to managed Kubernetes. Some are consequences of initial Kubernetes setup. For example, let's look at how Kubernetes handles unreachable nodes. By default, Kubernetes adds a five-minute toleration for pods to survive when a node is unreachable. After five minutes, pods are removed from the API server and rescheduled. However, this does not apply to pods managed by StatefulSets.

In our case, this led to application issues with workloads like CockroachDB or OpenSearch. Another example was an increase in packet drops for our network-sensitive workload. It took us a while to discover this was caused by nodes experiencing a freeze event, where the cloud provider moved VMs between physical hosts. Most people might not have noticed this because they don't spread pods across nodes as extensively as we do.

Not everyone will suffer from the same issues, but in our case, these problems had a noticeable impact on our workloads.

Bart: To address these problems, you developed a self-healing framework. Could you give us an overview of how this framework is structured?

Grzegorz: Sure. It was quite a journey. We started by dealing with a problem related to spot nodes. These nodes are cheaper, but the cloud provider can take them back at any time. This led us to design a system using Kubernetes controllers working as deployments to detect and react to these events, rescheduling pods from those nodes. This was the first fundamental part of our platform.

The second part comes from a node-centric point of view, where certain problem groups can only be detected or are easily detected when you have access to the underlying host node, as it's really hard to capture them just by watching node metrics or logs. We used this approach to expose low-level metrics or conditions by accessing the host. When a certain condition happens, we set a label on the node, which allows the second part of the framework to target this node and take appropriate action.

We ended up with a combined design of DaemonSets and deployments that work together to self-recover problems. We had two main principles: first, not to rely on higher abstractions—we used Kubernetes native labels and annotations and the host file system, avoiding dependencies on Istio or other databases and caches that could also fail. Second, the automation should not make the situation worse—do no harm. This approach has worked very well for us, and we continue to onboard new use cases as we find them.

Bart: Now, one of the first issues you tackled was handling spot nodes preemptions. How does your automation work in this case?

Grzegorz: Spot nodes are a special kind of nodes which can be taken away by cloud provider at any time. However, there is at least a 30-second notification before it happens. In practice, we saw that it occurs a few minutes earlier. The AKS plugin for node problem detector sets a condition on the node with a message starting with "preempt scheduled", indicating that this node will be preempted. However, this condition does not initiate any remediation.

Our workload watches for nodes with this condition and creates a custom resource, which is then captured by another part of the system. We try to delete all the pods with a 10-second grace period to give them a proper shutdown. This approach really helped us survive spot node preemptions, as most pods are properly shut down and recreated on working nodes.

Bart: Spot instances seem to present ongoing challenges. What other issues did you face with spot nodes and how did you address them?

Grzegorz: After going to production with the solutions, we discovered that not all signals can be captured because the underlying Azure metadata service endpoint is not always reliable. We observed that from time to time, the query to this endpoint done by the Azure plugin times out, which means we might miss signals.

To address this, we expanded our detector to monitor nodes deleted from the control plane with the spot nodes label. We created a CRD to note that a node has been deleted and we need to check for any orphaned pods in the API server. If we find such pods, we force delete them, assuming they are not running and should be rescheduled. This approach helps us close the gap and now works quite reliably for spot nodes.

Bart: Moving on to more general node issues, you mentioned problems with StatefulSet pods on unreachable nodes. Can you explain this issue and your solution?

Grzegorz: It was an interesting challenge. Each pod has a default 5-minute toleration added by Kubernetes for surviving the no-execute taint with an unreachable reason. After that time, the pods are deleted and rescheduled by the deployment controller, ReplicaSet controller, or other nodes.

However, this is not the case for StatefulSet pods. StatefulSets have a special notion in Kubernetes because they are supposed to have a unique identity. The StatefulSet controller cannot determine if the pod is running or not running due to a network split, so it takes no action.

In our case, this was causing problems because the pods were not running, and our StatefulSet services like CockroachDB or OpenSearch were under-replicated. We needed to force delete those pods to enable the StatefulSet controller to recreate them.

After this happened a few times, we created an automation that looks for nodes marked with no-execute taint because of being unreachable. If a node is in this state for longer than five minutes, we force remove every pod so they are rescheduled on other nodes, allowing our StatefulSet to self-recover. Since then, we haven't needed to take any further actions like this, so we consider it a success.

Bart: Cluster health can also be affected by accumulated pods. How did you approach the problem of finding issues with succeeded and evicted pods?

Grzegorz: Back in the time, we had struggles with our cluster being affected by many operations. One issue was that our etcd was growing, so we aimed to make our API server's life easier. We found we had many pods, typically resulting from short-living jobs, pods without controllers, or pods evicted due to node pressure like memory or disk constraints.

The churn of these pods was large, and it wasn't clear for our users when they saw 10 pods on the main site—some completed, some running—so they didn't understand what was happening. To mitigate this, we implemented a simple screening mechanism that periodically scans for pods in the succeeded phase or those evicted by node pressure. If these pods have been in that state for at least 15 minutes, we delete them. This keeps our cluster tidy and reduces unnecessary load on the API server, which improved our cluster's stability.

Bart: How about we discuss some of the more complex issues you encountered? Can you tell us about the network packet drops due to unbalanced IRQ and how you addressed it?

Grzegorz: It was challenging to root cause initially because at first glance, it looked like an application issue. Users were complaining about increasing package drops under high I/O sensitive workloads. We initially thought it was related to user configuration, but it happened randomly and repeatedly.

After deeper investigation, we found it usually occurred after the node hosting those workloads experienced a VM freeze event, which is related to moving the machine between physical hosts to mitigate hardware issues or during update operations.

We noticed that after these operations, network interrupts were handled by just one or two CPU cores out of 16, 8, or 32 on the node. This was causing a bottleneck in processing network traffic, leading to packet drops.

We discovered that restarting the IRQ balance service on the node (a systemd service responsible for evenly distributing hardware interrupts across cores) resolved the issue until the next freeze event.

To create a more permanent solution, we set up node-level automation to detect the number of CPU cores used for network interrupts. We established a condition that if less than half the cores were used for network interrupts, the node would be marked. We then deployed a DaemonSet that would automatically restart the IRQ balance systemd service on the affected node.

This solution has been working for at least a year, and we haven't seen similar issues since then.

Bart: Developing these automations must have been a learning process. What are some key lessons you've taken away from going through this?

Grzegorz: It was for me going beyond our comfort zone. The first lesson was that while I was comfortable working with Kubernetes, as platform engineers, we need to go beyond that. We need to look at our infrastructure, understand how cloud providers work, how users use Kubernetes, and how networking comes into play in the whole picture. To fully understand and solve problems, you need to know deep down and look beyond your expertise to see the whole picture and effectively debug issues.

The second learning lesson was that Kubernetes is a foundation to build platforms, but for specific use cases or setups, it needs development and customization. It allows building automation on top of it. To tailor the platform for your users and use cases, you cannot just rely on defaults or cloud provider offerings.

The final lesson was that with such complexity—managed Kubernetes, add-ons like Istio, networking configurations—you have a complex ecosystem where you cannot predict every issue. What you can do is focus on detecting problems and reducing mitigation time, so whenever issues happen, you can quickly spot and fix them.

Bart: Looking ahead, what are your plans for further developing the self-healing framework?

Grzegorz: Okay, the fundamental architecture is done, and what we are doing right now is extending particular parts. Recently, we are using our framework, especially the DaemonSet part, which works on a node level to solve the problem of noisy neighbors in our clusters. We call a workload a noisy neighbor when one pod on the node is trying to use most of the CPU, which makes the whole node and other workloads unstable in some cases.

Our automation, in short, detects pods which are excessively using CPU and sets limits on the offending pod using containerd API, which throttles the problematic pod while leaving the whole node healthy and allowing all workloads to continue operating smoothly.

The second issue we recently solved was related to nodes sometimes misbehaving after a freeze event. This was mainly manifesting as no inbound network traffic being allowed on a given node. Input traffic was not allowed, however, the pods and processes were able to go with external traffic from the node, which created some weird scenarios in our use case.

We didn't find the exact root cause. However, we noticed that restarting the kubelet on the node fixed the network. Our current automation watches the status of pods. If they are unreachable because kubelet cannot perform liveness probes to the pods, we just restart the kubelet, and usually it fixes the issue.

At this point, it was good enough automation on our side. We don't have specific plans to extend this framework. We're rather watching for new cases which can be automated using this framework. For us, I believe it's kind of done, and we will continue adding more automations if feasible, because you need to balance what you want to automate and not just create temporary workarounds.

My main advice would be to focus on optimizing for speed because you cannot foresee all issues. However, our framework allows us to quickly add new automations. Even though we are not sure what can come next, we are confident that we can automate them rather quickly. This approach was more effective in the long run than trying to predict every failure scenario in this complex universe.

Bart: I noticed that the transcript snippet is very short and lacks context. Could you provide more of the surrounding conversation or context about "What's next for you?" This will help me better understand if any specific terms need to be hyperlinked.

Grzegorz: We just plan to focus more on workload-related automations, which help our users automate tasks like collecting heap dumps or memory dumps at the workload level. When we detect an issue with the workload, we implement automation that helps users diagnose problems. We have the platform farm part already covered, but we want to focus on the users' part as well. That is mainly our plan.

Bart: All right, how can people get in touch with you?

Grzegorz: If you're interested in the MetaController project, feel free to look it up on GitHub. If you find it interesting, go to the Kubernetes Slack channel called MetaController and say hi.

Bart: Thank you so much for sharing your knowledge with our community. Look forward to speaking with you soon. Take care.

Grzegorz: Thank you. Take care.

Listen anywhere

Kubernetes experts reacting to this episode

Kubernetes resource optimization: Tackling CPU and GPU waste through automation

with Andrew Hillier