Learned it the hard way: don't use Cilium's default Pod CIDR
Host:
- Bart Farrell
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
This episode examines how a default configuration in Cilium CNI led to silent packet drops in production after 8 months of stable operations.
Isala Piyarisi, Senior Software Engineer at WSO2, shares how his team discovered that Cilium's default Pod CIDR (10.0.0.0/8) was conflicting with their Azure Firewall subnet assignments, causing traffic disruptions in their staging environment.
You will learn:
How Cilium's default CIDR allocation can create routing conflicts with existing infrastructure
A methodical process for debugging network issues using packet tracing, routing table analysis, and firewall logs
The procedure for safely changing Pod CIDR ranges in production clusters
Relevant links
Transcription
Bart: In this episode of KubeFM, I had the chance to speak with Isala Piyarisi, and we explored a real-world migration to Cilium-CNI. We discussed how a seemingly minor subnet allocation issue led to silent packet loss in production. The episode breaks down the debugging process, firewall analysis, Packet Tracing, and routing table deep dives that uncovered an overlooked CIDR overlap. Additionally, we discuss eBPF power observability with Thanos, security with Tetragon, and lessons learned from real-world DevOps challenges. If you are scaling Kubernetes, optimizing networking, or troubleshooting complex failures, you will definitely want to check out this episode. Today's episode is sponsored by LearnK8s. LearnK8s has been providing Kubernetes training to organizations worldwide since 2017. Their courses are instructor-led, with 60% practical and 40% theoretical content. Students have lifetime access to course materials, and courses can be taken by groups or individuals, either online or in person. To find out more about how you can level up, visit LearnK8s.
Isala: Let's get into the episode.
Bart: Hello and welcome to KubeFM, what three emerging Kubernetes tools are you keeping in mind?
Isala: My top pick would be Thanos. We use Thanos for high availability in our Prometheus instances, as well as its backup and restore functionality. Another tool I'm interested in is Tetragon Cilium, a security tool that analyzes system calls using eBPF. Additionally, I'm watching a product called TimescaleDB, which is an emerging replacement for ELK, built on top of Postgres. I'm keeping an eye on it because its vision is really good.
Bart: Now, just as a quick introduction, what do you do and who do you work for?
Isala: My official title is Senior Software Engineer at WSO2, but I mostly work on the DevOps side, doing R&D. I spend most of my time maintaining the network stack of our product called Courier, which is an internal developer platform. It helps organizations avoid having a million microservices and not knowing what is going on.
Bart: And how did you get into cloud native?
Isala: I graduated into cloud native. My journey in computer science started around early teens. I was into the game Call of Duty, and my friends and I played for hours. One day, a few friends asked if we could host our own server, since we were playing on other servers. Because I was a nerdy kid, I looked into how they hosted these servers and how we could host ourselves. That's where I got into Linux. After a few weeks of playing around, I got a server running on a DigitalOcean VPS. From there, I got in touch with DevOps and developed a passion for it. While working on that, I started coding and creating mods for the game. That's how I started. Eventually, our server became one of the top servers in the world, ranking in the top five.
Bart: Very good. Are you still doing anything with gaming?
Isala: Nowadays, I don't have much time, but when I get time, I play League of Legends is not available in the links table, however, it can be linked to its main website. However, since it is not provided in the table, the original text will be kept intact. Nowadays, I don't have much time, but when I get time, I play League of Legends. And when I get to my friends, we mostly play League of Legends. It's very good.
Bart: How do you keep updated with Kubernetes and the Cloud Native ecosystem? Things move very quickly. What works best for you: books, blogs, or podcasts? What's your favorite type of resource?
Isala: I think most of my updates come through the Pro X. I'm very invested in the tech community on Twitter, where I see new and upcoming AI and DevOps-related content. Besides Twitter, I'm a huge fan of Reddit, particularly the Kubernetes subreddit, r/Kubernetes, r/devops, and another subreddit I enjoy is r/homelab, where people share their DIY setups.
Bart: I'm also very much into DIY things, so I look around and learn things. If you could go back in time and share one career tip with your younger self, what would it be?
Isala: I would say, first of all, on a serious tone, start blogging. Since I started blogging maybe two years ago, I have learned a lot of things. Before, I knew the concepts, but when you sit down and start writing an article about it, you actually find that there are gaps in your knowledge where you can improve. That's one of the main things I would say for someone starting out: you don't need to be an expert. Just start writing something - you don't even need to publish it. When you write, you try to understand where you can improve.
Bart: As part of our monthly content discovery, we found an article you wrote called Learned it the hard way: Don’t use Cilium’s default Pod CIDR, which discusses not using Cilium's default pod CIDR. The following questions are designed to explore these topics further. Let's start with your background. You mentioned being involved with eBPF for a while. How did this lead to your role in migrating your clusters from Azure CNI to Cilium CNI?
Isala: I got into eBPF during my final year of undergraduate studies. My final year project was inspired by my internship experience and aimed to find anomalies within a Kubernetes cluster, alerting the security team before they became bigger problems. The project passively monitored the cluster and sent updates. While researching, I needed a vendor-neutral and language-neutral solution, which is when I came across eBPF. I found it to be a good fit for my project and developed it on top of eBPF. During the evaluation phase, I reached out to a few people on LinkedIn, including a solution architect at WSO2, the company I now work for. I showcased my demo, and they were impressed with my understanding. The solution architect connected me with someone within the company who is now my lead, Lakhmal. At the time, he was trying to build eBPF into the objects, which led to my recruitment. Initially, I was given the task of using another eBPF-based solution called Pixie to build an observability solution for a career platform. However, it did not work out as planned because Pixie was not production-ready at the time. We then came across Cilium, which has built-in observability capabilities and security features. Although migrating to Cilium was a risky move, we weighed the pros and cons and decided to proceed. That's how I worked my way into the role.
Bart: Migrating to a new CNI is a significant decision. What were the primary factors that led your team to choose Cilium for this migration?
Isala: We were initially using a solution called Pixie to provide observability, but we found it had a lot of good capabilities. However, at that time, early 2022 and 2023, it wasn't very stable. We wanted to provide a framework-agnostic observability solution for HTTP metrics. Since we are working on Kubernetes, we found Cilium to be an excellent choice, offering additional benefits like transparent encryption with WireGuard, load balancers, and Network Policies. We presented this to our product council, weighed the pros and cons, and decided to give it a try. It has paid off significantly by now. That's how we picked Cilium.
Bart: It seems like Cilium offered numerous benefits. However, you mentioned an incident that occurred post-migration. Can you walk us through what happened when your team promoted updates to the staging environment?
Isala: We were running our system in production for more than eight months with virtually no issues, just minor things, and it was running very smoothly. We were doing daily releases, five days a week. On this particular Monday, the SRE team promoted the dev environment to staging. After a while, we started receiving alerts saying the staging site was not working. This was not entirely unexpected, as sometimes pushing two or three days' worth of changes to staging can cause issues. The SRE team performed the natural diagnostic steps and pinned the problem to our load balancer, which was not responding due to an issue with NGINX. They suspected an NGINX configuration issue. The main concern was that we have two clusters, one in the EU and one in the US, and while the EU cluster was working perfectly, the US cluster was not responding. These two clusters have virtually the same configuration, which added to the confusion. The SRE team was stuck on what was causing the issue.
Bart: Complex networking issues often require extensive troubleshooting. What steps did your team take to investigate and understand the problem, possibly using tools like Packet Tracing or considering concepts such as CIDR and Network Policies, or even CoreDNS and DaemonSet, and maybe looking into Host Network Mode?
Isala: They did a lot of investigative work before reaching out to me. Their main concern was that the traffic might be getting dropped by the Azure Firewall. The firewall is the primary point of concern between the load balancer and the user. To investigate, they enabled logs on the firewall side, created load balancers, and performed node-level Packet Tracing and packet captures. The data pointed to a strange pattern where packets were coming through the load balancer to the firewall to the load balancer and then to the nginx pod. However, none of the packets were getting out of the nginx pods. This issue was only observed when traffic came through the firewall. To further troubleshoot, they created small VMs within different subnets to bypass the firewall, and those worked. The only issue occurred when packets came through the firewall, and it was not responding. As they dug deeper into the issue...
Bart: You uncovered an interesting root cause. For our more technically inclined listeners, could you explain what you discovered about the subnet assignments or more specifically the CIDR?
Isala: Since this was a Azure Firewall, when looking at it from a big perspective, it is like a firewall. We were using Azure Firewall, so we brought in an Azure expert to debug the issue because we don't have a lot of internal access to these firewalls. While they were checking different packet flows and how packets were coming and going, we noticed a particular issue. When calling from within the subnet where the firewall is on, it doesn't work, but it works from anywhere else. We were suspicious of why this was happening. It's not just the firewall; even if you bypass the firewall and come directly to the pod within the firewall subnet, it still doesn't work. To investigate further, we SSHed into the node and tried to figure out what was going on. When we listed the routing table, we actually found the issue. The root cause of this issue was that Cilium, by default, configures the cluster CIDR to 10.0.0.0/8, which is a huge IP subnet. Cilium takes small subnets from this and assigns each small subnet to each node, so each node gets its own subnet. However, in this case, the huge subnet had a lot of redundant IP addresses. What we noticed was that since this is an internal IP address network mask, when we checked the routing table, we saw that 10.1.0.0 went to node A, 10.2.0.0 went to node B, and at the end, we saw another entry that directly overlapped with the subnet of the firewall. We found that when the packet was coming in, it went to the closest routing table entry, but when the response packet was routed to go outside of the cluster to the firewall, the routing rule took place, and the packet got routed to a different node rather than outside the cluster because it overlapped with the firewall. In this case, there was no one to accept this packet because there was no destination within that particular node for the exact firewall IP. We weren't seeing any echo or anything, so it was just getting blackholed. When we found this, we immediately tested it by removing that node, which immediately solved the issue. We saw that the traffic was going out, and later we found that the reason it took so long was that Cilium was adding these nodes with each scaler. Whenever we promoted something to staging, there was a node scaler, so the IP range kept getting used. They were using some but not using others, so it took a long time to get to the critical point where it overlapped with the firewall.
Bart: It's intriguing that this issue only surfaced after months of running Cilium. What factors contributed to the problem remaining dormant for so long?
Isala: The issue was that we promote releases every day, which sometimes triggers a node scaler. When new nodes come, Cilium takes a new CIDR block from our IP range and gives it to the new node. In this particular case, it took a long time to work through the gigantic IP block that was given to Cilium by default. After about eight months, this reached a critical point where it overlapped with the Azure Firewall itself.
Bart: Once you identified the root cause, how did you resolve the issue in both [staging](What is meant by staging environment in this context?) and [production](What is meant by production environment in this context?) environments?
Isala: After finding the root cause, the fix was straightforward, but there were complications when applying it. The first complication was changing a single value within the Cilium config, specifically the cluster pool IP V4 CIDR range. According to the Cilium documentation, changing this value is not recommended as it may result in network disruption. However, since there was no alternative, the change was tested in the development environment. The team ran a DaemonSet on each node, which included client and server daemons that communicated with each other, to verify that every node could talk to every other node. After confirming the test results, a node scaler was triggered to introduce new nodes with the updated CIDR range. The SRE team then drained the existing node pool, allowing pods to be pushed to new nodes with the updated IP range. The process was monitored, and no packet drops were observed. The findings were presented to the product council, which accepted the risk. The change was then tested on staging and applied to production without any issues. After completing the change, additional tests were run to verify that all nodes were using the new range, and the old range was no longer in use. An extensive analysis was conducted to ensure that the new CIDR range would not cause conflicts with other cloud resources, and the SRE and helm teams were informed that this range was off-limits for future resource creation.
Bart: Challenging experiences often come with valuable lessons. What were the main takeaways for you and your team from this incident?
Isala: The main takeaway from this experience is that we were lucky it happened on our staging environment, so we weren't overly stressed, but we were concerned because we didn't know why it was happening. We went through a methodical debugging process, covering everything step by step. Otherwise, I don't think it would have been easy to find the root cause of the event. Many of us hadn't worked at such a low level on networking before, so tracking packets and finding the exact routing, especially with different network components, was a valuable learning experience. After this process, we all have a better understanding now. The main takeaway is that this was a good learning experience for us, and whenever a similar incident occurs, our team is now well-equipped to handle it.
Bart: Given the complexity of the issue, are there any tools or practices that could have helped prevent or detect it earlier?
Isala: That's a very good question. In a large organization like ours, it's incredibly hard to track every dependency and every resource deployed within our networks. None of our engineers or SRE team has encountered something like this before. I think the main advantage we have is experience. As I mentioned, we are now well-equipped to handle something like this. Our SRE team can identify a potential issue from a mile away. If something like this is happening, it could be a routing issue. We can go straight to the routing table to investigate. We had a similar issue in one of our other clusters last week. One of our engineers working on it went directly to the routing table and checked what was going on. The issue was at the node level, where a component running in Host Network Mode was unable to talk to the CoreDNS running in the cluster. It turned out to be a missing routing rule. The engineer was able to identify the issue by checking the routing table on the node. Ultimately, there's no single magic bullet; it just comes down to experience.
Bart: Stressful situations obviously aren't limited to work, such as when you face a problem with Cilium and it decides to stop. However, there are also many stressful situations that can happen in life. What are your strategies for managing pressure?
Isala: Professionally, I'm extremely lucky to be in a team where we treat various learning opportunities as a way to grow, rather than pointing fingers at each other. This mindset gives us a huge incentive to test out new things, learn, and find different ways to improve. Dealing with challenges is not very stressful in our team because it's a matter of finding the root cause and fixing it, rather than worrying about job security. Having a good team really helps. Beyond that, I make it a habit to talk with my lead regularly and sync up so I don't have any potential blockers that he's not aware of. This way, when it comes to massive deadlines, he's always up to date and we can be flexible. Personally, I deal with stress by weightlifting, which helps me de-stress. When I'm working on a huge feature near a deadline, weightlifting gives me mental clarity and a fresh mindset. After a heavy day at the gym, I always come back with a clear mind.
Bart: Now speaking of frustrating situations, your talk was accepted at KubeCon North America in Salt Lake City 2024, but you could not attend for reasons outside of your control. Has this stopped you from applying to the next conference or KubeCon, or when are you going to be speaking next?
Isala: I was originally supposed to speak at eBPF of the day at KubeCon NA, but unfortunately, due to unavoidable circumstances, my visa was declined. I was eager to meet people who are building tools I use daily, but things were out of control. Currently, one of my talks is waitlisted on KubeCon EU, but I'm not sure if I have a good chance. My VP, Kanchan, was very supportive and encouraged me to submit to more conferences and apply for more CFPs. I hope to apply to more conferences, including the next one. If I get selected, I will definitely participate.
Bart: Do you have anything else planned? What's next for you? Whether it's weightlifting, writing articles, video games, what do you have going on next?
Isala: I'm mostly researching how to use AIOps and eBPF to build better tools that make the life of SREs easier. I've found that current LLM-based agents are powerful and can do a lot to make our day-to-day life easier. In my day-to-day work, I'm mostly focused on building the next generation of features for an observability platform with Prometheus. I have a few articles in mind, but I haven't had the time to work on them. I was also working on a project called proxyless eBPF, which uses proxies to scale to zero using eBPF. One of the committee members asked for an update on my progress from last month, and I promised to provide one within two weeks. However, due to other commitments, I couldn't get around to it. I'm hoping to quickly build a solution using maybe KEDA or Cilium.
Bart: How can people get in touch with you?
Isala: If you visit my personal website, you will find a lot of information about me, and all the ways to connect with me. However, if you are on Twitter, the best way to connect with me is through a direct message. I'm happy to connect with anyone who is interested, and I'm always in the mood to share my knowledge.
Bart: Very good. As you've done so with us today, Isala, thank you for joining us on KubeFM. You did a wonderful job responding to our questions. We wish you nothing but the best and hope to see you give a talk at KubeCon very soon. Thanks a lot.
Isala: Thank you.