More Kubernetes Than I Bargained For

Nov 25, 2025

Host:

Bart Farrell

Guest:

Amos Wenger

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Amos Wenger walks through his production incident where adding a home computer as a Kubernetes node caused TLS certificate renewals to fail. The discussion covers debugging techniques using tools like netshoot and K9s, and explores the unexpected interactions between Kubernetes overlay networks and consumer routers.

You will learn:

How Kubernetes networking assumptions break when mixing cloud VMs with nodes behind consumer routers, and why cert-manager challenges fail in NAT environments
The differences between CNI plugins like Flannel and Calico, particularly how they handle IPv6 translation
Debugging techniques for network issues using tools like netshoot, K9s, and iproute2
Best practices for mixed infrastructure including proper node labeling, taints, and scheduling controls

Relevant links

Transcription

Bart: In this episode of KubeFM, we're joined by Amos Wenger, developer, writer, and the protagonist of one of the most chaotic and deeply educational Kubernetes stories to hit the internet recently. In his article titled "More DevOps Than I Bargained For," Amos walks us through how a seemingly harmless decision to add a home computer as a Kubernetes node turned into a full-blown production incident.

The story involves TLS certificates failing to renew, cert-manager challenges timing out, pods becoming unreachable depending on their scheduling, unexpected NAT behavior in both IPv4 and IPv6, Kubernetes overlay networks performing unintended translations, CNI plugins behaving in ways only ancient documentation vaguely hinted at, and the bizarre consequences of mixing cloud VMs with a node sitting behind a consumer router.

In this episode, we'll break down, step by step, the root causes behind the outage. We'll explore what actually happens when cert-manager lands on the wrong node, how Kubernetes networking models mask critical differences between environments, and why Calico, Flannel, and IPv6 NAT 66 all collided to create a problem you simply can't reproduce in a normal cluster.

If you care about cluster networking, have wondered what happens when you blend cloud infrastructure with home lab hardware, or enjoy watching Kubernetes expose every assumption you didn't know you were making, this episode is definitely for you.

This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has provided training to Kubernetes engineers worldwide. Courses are instructor-led, 60% practical and 40% theoretical, offered both online and in-person to individuals and groups. Students have access to course materials for life. For more information, check out learnkube.com.

Now, let's get into the episode. Welcome to KubeFM. What are three emerging Kubernetes tools that you're keeping an eye on?

Amos: This is a tough question. I initially didn't want to answer it, but since I've been doing a lot of Kubernetes work, I'm now keeping my eyes on many things. The problem is I'm unsure which technologies are emerging and which have been around for a long time because everything is new to me. If I had to pick three, Argo is not emerging—it has been around for a while.

Bart: It's been around for a while, and there's a strong community behind it in the GitOps world.

Amos: For me, it's new and emerging. Argo CD is a game changer in terms of just clicking around. Before that, it was K9S. K9S was another game changer. We went from kubectl in the console to K9S with a GUI, and then Argo CD. Does anything else come to mind? I'm keeping an eye on my own SSO software because I've tried the current options. I tried Keycloak and the one that starts with Z that I forgot. I'm not happy with any of them. I think I'm going to develop my own for UI and tech stack reasons.

Bart: Sounds like there's an opportunity there for growth. Can you tell me more about what you do and who you work for?

Amos: I work for myself. I'm lucky that I got fired about a year and a half, two years ago. I had enough patrons and sponsors on GitHub and Patreon to continue my work. I've been writing articles and producing videos about Rust since 2019, becoming more serious over time. When the job ended, I realized I didn't need to look for another job. I could continue my current work, with money on the side, and see if the revenue would increase and become sustainable long-term. So far, it has been successful.

Bart: Fantastic. How did you get into Cloud Native? What's the story behind that?

Amos: I guess we would have to define Cloud Native. I was just talking about Kubernetes. I have tremendous imposter syndrome coming into this podcast. I was very nervous when you initially reached out because of the buzzwords. I'm constantly Googling things, trying to understand what they mean. It's not that scary when you know the definition. So, Cloud Native—how would you define it exactly?

Bart: Cloud Native means designing and running applications that automatically scale, heal, and evolve by leveraging containers, orchestration, and modern distributed infrastructure. At least that's the official definition. What about you?

Amos: Essentially, we had this monstrous Ansible playbook to deploy a reverse proxy at the edge on a bunch of edge nodes. A colleague of mine who was more senior in these things pushed for using K3S instead—having a K3S cluster on every edge node. I used Spinnaker to do rollouts. I wasn't very attracted to it initially and was resistant, like much of the team. But over time, I was convinced enough that when it came time to maintain my own infrastructure, I decided to try it.

My website is a static Hugo site. I just needed login, some articles for early access, and a full-text search engine that runs on the server because all client-side search options are bad. Suddenly, I was maintaining a server and thought, "Let's try this K3S thing. It can't be that bad." And it's been downhill from there—deeper and deeper into the Kubernetes ecosystem.

Now I'm super into it. I was looking at some vendor offering with containers globally that scale to zero, promising low costs and complete management with terminal TLS. It sounded super cool and would definitely cost less than my current setup. But I don't want it. My cluster is too comfortable. I know how to do observability. I know how to do everything. I don't want to change.

Bart: The Kubernetes and cloud native ecosystem moves very quickly, and it is often a challenge for people to stay on top of all the changes and stay up to date. Some people prefer blogs, some people prefer videos, some people prefer

Amos: Podcasts: what works best for you? I'm gonna confess something: ChatGPT worked really well. Here's the thing about ChatGPT—everybody hates it, but I'm ashamed to admit it was actually pretty good. When it got the ability to search for things, it became more useful.

If you sense it's an adversarial conversation, sometimes it just makes things up. You have to say, "That sounds wrong," and then do more research. It's not a source of truth, but it's a great fuzzy search when you don't know what you should know. It helps you learn the vocabulary and language of what you should be looking at. Then you can find primary sources like documentation and source code directly to figure out how things work—something you can do in the open-source ecosystem.

I don't read a lot of blogs because I write my own, and I just don't like to. Documentation I find often lacking, so I end up reading source code directly to figure out how things work—something you can do in the open-source ecosystem.

You're hearing my cat, by the way? His name is Sherlock. He's white and wonderful, and he's featured in some of my videos sometimes.

Bart: Great! Welcome, Sherlock. This is the first time we're having a cat. We've done about 80 episodes, and this is the first one featuring a cat, so it's very exciting. (Jokingly) You need better cats. I'm kidding. You need cats.

Amos: We don't lock up their cats for the recording.

Bart: If you could go back in time and share one career tip with your younger self, what would it be? I made the mistake several times in a row because it takes me a while to learn.

Amos: So I just get friendly with your boss in a small company, and then have that kind of relationship where you think they're your friend because it's a small company with not a lot of bureaucracy. I regret not recognizing the power dynamics and letting them take advantage of me. They would say, "Can't you work a little bit more?" taking advantage of the fact that I really like to work a lot. They didn't stop me—they were like, "Great, more work for the same salary."

There were definitely some shady things. It was a long time ago, and I'm not nervous about anyone recognizing themselves in this story. But my advice would be: a job is a job. Even if you like your colleagues and bosses, HR is not on your side, and your boss is not your friend. It's a job.

Now that I work for myself, I see things differently. When I work on my own business, I'm building something for myself. But if you're building something for someone else, count your hours, clock out, and go have a life. That would be my advice.

Bart: Very good. Sound advice. As a freelancer, I strongly agree. A lot of people need to hear that at the end of the day, you have to understand the mechanisms you're a part of. This is about making money, and you should be very transparent about that. Of course, you can have friendships at work and positive relationships based on trust, support, respect, and caring for each other. But it's also important to keep in mind what this is about and be honest about it. We can all have more adult conversations and be respectful by keeping these things in mind. I think that's very sound advice. We'll definitely link the video in the show notes.

Now, we're here today to talk about an article you wrote titled "More DevOps than I bargained for". Before we get into the specific incident you mentioned in the article, can you give us some background on your Kubernetes setup and what led to adding a home computer as a node to your production cluster?

Amos: So it's pretty simple. I had a bunch of VMs and built my own GeoDNS, essentially, but not the fun way—just GeoDNS. I had small Hetzner Cloud VMs in different regions. They started offering ARM64, and back then I was looking to be cost-efficient. ARM64 instances were slightly cheaper, so I thought it was a good time to build my applications for both ARM64 and AMD64.

Unfortunately, I didn't have large enough CI runners to do both. I migrated a bunch of things to ARM64 but still needed x86 because Hetzner doesn't have ARM64 instances in some regions like the US and Singapore. I also realized, somewhat belatedly, that after migrating, some services were down and I suddenly needed an x86 image of my software.

I have a Mac Studio at home that runs 24-7 for other home tasks. I figured I could just create a VM using UTM or similar software to run an x86 VM on my Mac Studio and have it join the cluster. Seemed simple enough.

Bart: Alright, so you added this home computer to your cluster and initially everything seemed fine. But what was the first sign that something was wrong, and how did it manifest?

Amos: The TLS certificate stopped renewing. I noticed some domains were wrong, and some certificates expired and didn't renew. Normally, you would use cert-manager—I don't know if there are other solutions in Kubernetes. Cert manager uses HTTP endpoints, generating a challenge by creating an endpoint mounted at .well-known with a secret. Let's Encrypt servers hit your server to verify that you actually own the domain name. This is one type of challenge. The rest of the article wouldn't exist if I had figured out that you could do DNS challenges instead, which I switched to yesterday.

Bart: When the certificate renewal started failing, how did you narrow the problem down to the home node specifically? Well, I used K9s to look into where the challenge...

Amos: Pods were running on Cloud VMs and working fine, but the ones on the home node were failing. I didn't lock down well enough what could get scheduled on the home node. You can specify node selector queries, you can specify tolerations—there are many ways to control where pods get scheduled. I could have tainted the node in some way to prevent this problem. There are so many different ways I could have prevented this from happening. The point was, it was kind of a free-for-all, and the scheduler saw some free compute and free resources and thought, "Why not put the challenge over there?"

Bart: Now, let's talk more about why the home node was different. Your cloud VMs have public IP addresses, but your home computer sits behind your router. How did this difference in network setup cause problems?

Amos: I have a video about this called "I've just had three coffees and I'm going to explain how the internet works." The problem is that we don't have enough IPv4 addresses. We're essentially cheating because there aren't enough IPv4 addresses for all internet-enabled devices on Earth. All the devices in my house are sharing one public IPv4 address. It works in one direction: my Mac Studio VM was able to join the cluster. However, when Let's Encrypt attempted to validate the challenge, it couldn't because it hits the router, which isn't expecting a call on that port and simply blocks it.

Bart: So you started debugging this network path issue, and in the process, what tools did you use to understand how the traffic was actually flowing?

Amos: My go-to is K9s, where you can see nodes, their IP addresses, and get information about pods. However, that's not enough. Even though K9s lets you shell into an existing pod, I tend to have slim images that don't even have curl. What you want is a debug pod next to that. This is where netshoot comes in. Netshoot is a namespace from nicolaka with netshoot, which has curl and a bunch of other network tools, and that's where you can see what the traffic is actually doing. From there, I mostly use IPAdder from iproute2 tools. We don't use ifconfig anymore. On Mac OS, I use traceroute, though it wasn't actually that helpful. Then just curl it and see where it goes.

Bart: At this point, you discovered that your cluster was doing something called NAT, or Network Address Translation, for both IPv4 and IPv6. Can you explain what's happening with the network traffic?

Amos: What I expected was that because I have real IPv6 and IPv6 support from my home internet service provider, the VM would have its own IP address taken from that prefix and be publicly routable and reachable from the outside. However, this is not how it works for IPv6. For IPv4, the node's address was not taken from the CIDR range I configured for my nodes, but instead was taken from the LAN network—like 192.168.0.100—instead of being in the 10.x range.

Bart: Now this IPv6 behavior surprised you because your home internet has real IPv6 addresses. Why was the cluster translating these addresses?

Amos: Kubernetes uses an overlay network. The addresses of the pods are not necessarily public; they are used so pods can communicate with each other. The communication is encrypted between pods, depending on how it's set up, including for IPv6. This means there's some address translation occurring. In this article, which was quite confusing, there was also Network Address Translation (NAT) happening inside the VM hypervisor on macOS. As a result, multiple levels of IPv6 translation were in progress.

Bart: You mentioned switching from Flannel to Calico as your CNI plugin. What drove this decision? Why did you decide to do this? And what's the difference between these CNI plugins if people out there are faced with the same choice?

Amos: The main reason was definitely because Calico can handle the IPv6 translation better. I was hoping to at least fix IPv6 with that. The type of transition it's doing is called NAT66. Encrypted traffic between nodes depends on how your cluster is set up. In Calico, it definitely does that. It has a WireGuard network that's transparent. You can see it when you look at the pod logs, as it boots up and bootstraps that network. It's fun to see.

Bart: Once you understood the problem, you needed pods on the home node to use real IPv6 addresses instead of the private overlay addresses. How did you approach solving this?

Amos: With Calico, you can create different IP pools and decide which nodes pick from which pools. This means you can have private IP addresses for your cloud nodes, and the home node can get a real IPv6 address from the prefix of your ISP.

Bart: Then you tested your solution and discovered something that was quite odd: the pods were accessible from the internet, but not from your home network. What was going on there?

Amos: This is a fun thing called LAN hairpinning or NAT scope problems. I'm not sure exactly of the details, but it is documented and has a Wikipedia page, so it exists.

Bart: Throughout this debugging session, you were working at 4am with your website down. How do you decide when to push through versus when to roll back and try again later?

Amos: I like breaking things just to learn about them and then writing about it so I don't forget what happened. At this point, I hadn't really backed up the old images. This is a very bad example of ops. If it was for a business, it would have been terrible. We would do a blameless post-mortem, but people would blame me silently. In this case, it was just me. People can wait to read articles. It's on archive.org.

In general, I'd rather roll forward, probably because of a sunk cost fallacy. But also because every time you have a problem like this, it's an opportunity to learn something new. If you just roll back, then you lose the situation in which you can study. So if it's something that's not too important and you can afford the downtime, it's nice to be able to explore and learn about new things. The main thing I learned is that I never should have added this node, or I should have tainted it.

Bart: This incident revealed many hidden assumptions about node networking in Kubernetes. What should teams consider when mixing different types of nodes, whether they're cloud, on-premises, or edge?

Key points to note:

The reference to different node types suggests understanding CNI configurations
Networking considerations across cloud, on-premises, and edge environments require careful planning
Different networking plugins like Flannel or Calico might behave differently across infrastructure types

Amos: They don't have all the same network characteristics. I think I've learned my lesson. I now label my nodes religiously and have very strict node selector directives. I have a whole process where Terraform brings up the VMs, then a Rust script reads the Terraform output and generates an inventory. Ansible then generates a K3S template config that assigns labels to the nodes.

I used to have only node types like edge, control, and compute. Now I have five or six different labels which control what you can do. For example, you can have cloud volumes on Hetzner Cloud VMs, but you can't have cloud volumes on dedicated servers.

Would I do it again? Yes, especially if you need a lot of computers. Buying multiple computers can make sense. But I would read this article several times, carefully schedule loads, and test all the networking. This is not the first time that TraceRoute has been completely useless because it doesn't actually reflect how packets are routed. I just wanted to sneak that in because TraceRoute looks like it's going to solve everything, and it really doesn't.

Bart: Looking back, what would you do differently? And what advice would you give to others who might be considering adding Homelab resources to their production clusters?

Amos: Don't do it. Just make a second cluster. In retrospect, I took for granted that networking worked. I didn't actually read about it. Believe it or not, it's not really obvious from this episode. I'm quite knowledgeable in many aspects, but networking was not really one of them. I assumed it would just work and that my surface-level knowledge was enough. In fact, when I saw the acronym CNI the first two or three times, I should have stopped and read about it and compared the options much more than I did.

Bart: We've talked to a lot of folks in the ecosystem about their favorite and least favorite Kubernetes features. When it comes to least favorite Kubernetes features, networking and network policy management are generally at the top. Do you agree with that?

Amos: I've destroyed two clusters that way.

Bart: Fair enough. And on the other side, what would be your Kubernetes feature if you had to choose one?

Amos: It's pretty cool how easy it is to move loads around, just cordon drain, and you're done. I like seeing the number of pods drained. It's just neat when it works.

Bart: And what's next for you?

Amos: I ran the CDN with GeoDNS and simple networking for a long time. I'm learning about Anycast, getting my own ASN (Autonomous System Number), and leasing some IPv4 address space. I'll learn more acronyms, and we can expect some downtime on my website.

Bart: If people want to get in touch with Amos Wenger, the best way to do that would likely be through his GitHub profile or Patreon page, as these are common platforms for developers and content creators to connect with their audience.

Amos: Probably go to my YouTube channel, watch something, and then leave a comment. I always read my comments.

Bart: Thanks so much for joining us today on KubeFM. I look forward to speaking with you again in the future. Take care.

Listen anywhere