Rebuilding my homelab: suffering as service
Host:
- Bart Farrell
This episode is sponsored by Nutanix — innovate faster with a complete and open cloud-native stack for all your apps and data anywhere.
Xe Iaso shares their journey in building a "compute as a faucet" home lab where infrastructure becomes invisible and tasks can be executed without manual intervention. The discussion covers everything from operating system selection to storage architecture and secure access patterns.
You will learn:
How to evaluate operating systems for your home lab — from Rocky Linux to Talos Linux, and why minimal, immutable operating systems are gaining traction.
How to implement a three-tier storage strategy combining Longhorn (replicated storage), NFS (bulk storage), and S3 (cloud storage) to handle different workload requirements.
How to secure your home lab with certificate-based authentication, WireGuard VPN, and proper DNS configuration while protecting your home IP address.
Relevant links
Transcription
Bart: In this episode of KubeFM, we got a chance to speak to Xe and hear about their journey through Kubernetes, from the evolution of their Homelab to tackling some of the thorniest challenges in the field, such as managing secrets, persistent storage, and choosing the right ingress solutions. Xe brings powerful advice on how to start with Kubernetes and choosing the right OS for a Homelab setup. Talos Linux makes an important appearance in this episode. Xe also shared why publishing failures and learning from them is one of the best ways to grow in this space, and how building real connections at conferences has transformed their experience in the community. Whether you're just starting your Kubernetes journey or have been in the ecosystem for a while, Xe's insights are sure to resonate and inspire you. So, grab a coffee, settle in, and let's get into it with Xe iaso.
This episode is sponsored by Nutanix Cloud Platform. Is there a single platform for running Kubernetes in AI anywhere? Yes, there is. The Nutanix Cloud Platform is the ideal choice for cloud-native and AI applications, supporting all major Kubernetes distributions. With Nutanix and Kubernetes, you can scale your production seamlessly while simplifying your operations. To learn more, you can find the link to the Nutanix website in the comments. Now, let's check out the episode.
All right, so first things first: what do you do and who do you work for?
Xe: I do developer relations, writing, and site reliability work. I currently work for myself, which I claim is a company that totally exists on LinkedIn.
Bart: And how did you get into cloud native?
Xe: When I started my career, I got into cloud native stuff kind of by accident because AWS's free tier existed, and cloud native was kind of cheaper. After I managed to get rid of some of my existing responsibilities that required stateful Linux servers, I realized being an IRC systems administrator wasn't ideal. It's a lot easier to adopt a cloud-native workflow where the compute isn't really persistent, except in cases where it needs to be. Even then, you can limit it if you try.
Bart: And so, what were you before you got into cloud-native?
Xe: A college student where I got a PhD in dropping out.
Bart: The Kubernetes ecosystem moves very quickly. How do you stay up to date? Is it with blogs? Is it with podcasts? What works best for you?
Xe: Well, the Kubernetes ecosystem - I don't really stay up to date with it unless there's something that directly impacts me. I have compute, network, and storage covered, and as far as I care, that's all that really matters. I don't really keep up with it, but big announcements get sent my way and I end up reading about them.
Bart: If you could go back in time and share one career tip with your younger self, what would it be and why?
Xe: Have a blog. Write and publish what you write, what you learn, and what you succeed at. More importantly, publish what failed. Humans learn from failure better than from success. If you really want to learn, publish the stuff that went wrong. This approach is kind of anti-ego, as it involves admitting that something went wrong and you're not perfect. However, it's actually even more important to publish what went wrong than what went right. When I published my entire saga of rebuilding my home lab, I made sure to include what went wrong and what I failed at, because that information is critical.
Bart: And with that in mind, Xe is a very prolific writer. Your blog has more than 400 articles ranging in length and topic from things like meditation to WireGuard. We didn't have the chance to read all of them, but one in particular stood out, which is titled "Do I Need Kubernetes?" It was published in August 2022. The article isn't what we would say is short, but let's just read a little bit of it out loud so our listeners can get some context. What was your thinking process then and how has it evolved in the following two years?
Xe: So, I'd be amazed if you read all the articles, as it's actually closer to 500 at this point. I haven't done the calculation to update the total, and there are some automatically generated articles from templates, such as "Users of language X" or "Users of language Y where this regularly happens." I'm pretty sure it's closer to 500 at this point. I think the only person who has read even close to all of them is my husband.
The general reason I suggested against using Kubernetes for everything is that, at some level, Kubernetes is the most generic tool ever created. It's adaptable, from running on Raspberry Pis to operating complex systems like F-35 planes with Istio, and even being used by companies like Chick-fil-A for their restaurant operations. This is both a blessing and a curse, as Kubernetes is simple at its core but can be complex due to its declarative and eventually consistent nature, which can be confusing for people.
At the time I wrote that, everyone seemed to be building their business on top of Kubernetes by layering random components to achieve an allegedly usable result, resulting in a big pile of configuration YAML spaghetti. I hated it. What really sucks, though, is that Kubernetes sucks all the oxygen out of the room. As a result, Docker Swarm is basically dead at this point. Mesos might be an option if you're building an everything app, but Nomad has a glass ceiling. Once you hit it, you'll run into weird issues where production will just die, and you'll have no idea why. You'll have to throw it out and start over.
Bart: I know you mentioned your home lab already, but as part of our monthly content discovery, we found an article that you wrote, "Rebuilding My Home Lab: Suffering as a Service." Setting up a home lab has become increasingly popular among developers wanting to learn Kubernetes. We've had quite a few guests who have also spoken about their home labs. What made you decide to rebuild yours and why mention the suffering so much?
Xe: The main thing that made me want to rebuild it is that my idea, my ideal for my home lab, is compute as a faucet. Spotify expressed a product vision as "music as a faucet" a while ago - you turn on Spotify when you want music and turn it off when you don't. I kind of want that with my home lab, where I can just say, "go do this somewhere" and it complies. In my previous setup, I had to think about things like what nodes were running, what processors were available, what storage was available locally, and what GPUs were attached to it. That did work, but it was cumbersome, unwieldy, and overall just not worth the effort. So, when I was remaking my home lab, I wanted to create something where I could just give it tasks to do and have it do them without me having to write custom software to make that happen, and ideally without having to SSH into the machines for any reason.
Bart: Now, the choice of operating system is often crucial. Two previous guests that we've had on the podcast, Mircea and Gazal, explored different approaches, such as Talos and BottleRocket. However, we noticed that you started your Homelab journey with Rocky Linux. Tell me more about that.
Xe: When I started out, I researched community management practices of the various tools I was adopting because I had some bad experiences with tools that had subpar community management approaches. Rocky Linux seemed to have a good community management approach. I also chose it because I wanted something based on Red Hat, as Red Hat is well understood, easy to Google, and has extensive documentation. In the worst-case scenario, AI companies have aggregated the documentation, so you can ask a language model to generate something, and it'll probably work. I didn't want to care about the configuration my home lab uses as long as it works.
The machine I tested it on initially was probably a mistake. At the time, my old Mac Pro, the trashcan Mac Pro, was my most idle node, running Prometheus. Since nobody would be affected if my Prometheus server went down, I used it as my testbed. That's when I ran into the first issue where Rocky Linux detects that particular model of Mac Pro as needing an HFS EFI partition instead of a FAT32 EFI partition like the rest of the world. Although that Mac model is fine with a FAT32 EFI partition after a firmware update, Rocky Linux doesn't know how to check the firmware version. It assumed it needed a Mac EFI partition and didn't have the tools to properly create it. The only solution was to manually install Rocky Linux. However, I didn't want to rebuild the ISO or do painful hacks, as that would require caring about the configuration.
I also wanted to use Ansible to manage things, as people had told me it had improved. However, I found it to be the same level of suffering, complete with YAML grammar issues and the Norway problem. And then there's the secret extra YAML problem, the Ontario problem. Do you know about the Ontario problem?
Bart: I was going to say, for people who aren't familiar, such as myself, can you tell me about the Ontario problem?
Xe: YAML has a bunch of values that can be interpreted as true and false, including true, false, yes, no, on, and off. NO is the ISO country code for Norway, and ON is the province code for Ontario. This actually caused me some problems when I tried to brand the location of a machine in Ontario, where I live. The value got put into the configuration as true, and it failed because it wasn't a string, which is when I gave up. Additionally, using something like RHEL or Rocky Linux with Ansible requires me to SSH into the host to perform tasks, and I prefer not to use SSH.
Bart: So, it's clear that Rocky Linux wasn't good enough, so you moved on to Fedora CoreOS. Was the CoreOS part of it that led you to it? Perhaps some good memories? Tell me more about that.
Xe: My career started around the time when the first legacy CoreOS stuff was in public alpha. At the time, CoreOS was the first immutable Linux operating system that you could download and use. It was way ahead of its time. It had Docker as the only way to run programs on the machine, and it had this really innovative thing called Fleet. Fleet, for those who don't remember it, is distributed systemd. You give it a systemd unit and it runs it somewhere, and you can look at the logs and it will give you the logs from any node in the cluster. It was beautiful. That ultimately ended up dying because of Kubernetes, of course. But if I had infinite time and budget, I'd probably go back and remake Fleet or update it to the modern era because it's just so good.
Bart: Now, during your search for alternatives, you looked at several specialized distributions like RancherOS. What did you find in that process?
Xe: Way back in the CoreOS era, there was also an OS released by Rancher called RancherOS. It took the minimalism of CoreOS to an even deeper level, where the only thing running on the system was Docker. When you turned on the computer, it would do two things: it would boot, initialize hardware to the point where you get a network stack up, and then start the system Docker to start things like the DHCP client, the user Docker interface, or the SSH daemon. Every time you SSHed into it, it would create a separate Docker container for you to do system administration tasks in. It was absolutely nuts and way ahead of its time. It ended up not really taking off because you couldn't apt-install things on it, and people weren't ready for an app-less life.
Bart: After evaluating several options, you settled on testing Talos Linux. Why was it so compelling and how did it end up comparing to others?
Xe: Talos Linux initially came on my radar when I was speaking at All Systems Go Conference last year. The person after me was from Sidero Labs, talking about how Talos Linux is crazy minimal. That reminded me of another thing I use for a Raspberry Pi mounted inside one of my home lab nodes as a last-resort IPMI, called GoCrazy. GoCrazy isn't a Linux distribution, but rather a Linux implementation. It's closer to Android than desktop Linux. Everything except the kernel is written in Go, and it's a bunch of Go services that start on boot. The init is written in Go and listens on port 80 for HTTP updates. It's frankly wild, but not really production-worthy.
Talos Linux struck me as similar to GoCrazy, but production-worthy. That's why I ended up picking it up. Talos Linux does two things on boot: it initializes the hardware enough to get the network stack working, and it launches Kubernetes. That's it. In its default configuration, it has only 11 binaries. It's impressive. I love it.
Bart: After installing the operating system, you often need passwords and other credentials to install the rest of the tooling. This is probably an uneventful task for most engineers, but you found an interesting twist to this story. Can you tell us what approach you ended up taking for your home lab?
Xe: I try to avoid passwords as much as possible. Passwords are still the industry standard, but they're starting to become obsolete because they have significant drawbacks. They're a credential that is easy to copy and don't prove anything other than knowledge of that credential. Passwords can be exfiltrated and used for malicious purposes. Talos Linux, on the other hand, uses CA certificate authentication out of the box. It creates a ClusterScope CA for 10 years, and your kubectl configs use the same certificate authority as your Kubernetes client configs, or possibly a different sub-certificate authority. Either way, it works effectively, and I don't need to know the details of how it does it, which is great.
Bart: With the secrets out of the way, it's now time to install the rest of the tooling. Package management in Kubernetes is quite a hot topic. We recently had Brian Grant joining the show to discuss the history of the Kubernetes resource model. Other guests, Jacco and Alex, had their turn to discuss Helm, saying its design is fundamentally flawed. It sounds like you went through a similar experience working with Helm in this setup. Is that correct?
Xe: I had to use Helm for some things, like the OnePassword Operator to make managing secrets easier, or ingress-nginx, and Longhorn. I also use the WireGuard operator. I use Helm because there was no other option, and for that, I use a tool called Helmfile. Helmfile allows you to list all the charts and repositories in a single YAML document. Then you run helmfile apply
, and it does what you want. It's good enough that I don't have to think about it, and that's all I really care about. As you mentioned, my take is that Helm is the right tool with the wrong implementation.
I was looking this up on Stack Overflow, and I found the legendary "parsing HTML with regex" answer, where the person slowly descends into madness. One of the things it mentioned is that XML is a level three language and regex is a level two language. This strikes me as similar to Helm, because YAML is an object, and current best practice is to do string templating of YAML, a whitespace-sensitive language. This works, but it's not elegant. In general, Helm is using a level two language for a level three task. You have to use the indent helper and the dedent helper or marshal things to JSON. Honestly, about half the reason I try to avoid Helm as much as possible is because, at my previous employer, we had machine-generated Helm templates that combined the pain of Go templates, YAML, and template expansion with none of the advantages. This nearly burned me out to the point where I considered quitting the industry entirely.
However, I think Helm is now mostly avoidable thanks to Kustomize. I kind of wish Kustomize had better support for replacing exact variables, but you can use patches for that, and patches are probably a superior model anyway. Machine-generated patches with the machine spitting out JSON work well, since all JSON documents are YAML documents.
Bart: And it's a known fact that persistent storage and databases are two of the fundamental challenges in Kubernetes. How did you approach this in your Homelab environment?
Xe: In my Homelab, I use several different container storage implementations. I have three tiers of storage. I have Longhorn for data that I want to be replicated between computer boxes and automatically backed up to the cloud. This is for things like my QIDI, virtual machine disk images, and a bunch of other stuff. One of my most used things is a SQLite database on top of Longhorn in a one-gigabyte persistent volume claim. One gigabyte is probably overkill, but it can't be sized down easily.
The second tier is NFS Subdirectory Provisioner to my NAS via the NFS subdirectory provisioner. This is used for things that require bulk storage, are accessed infrequently, and I would be okay if the data was to mysteriously vanish one day.
The third tier is using CSI S3 to Tigris. Disclosure: I am contracting with Tigris, but I am not a representative of them. I just really like what they do. I use CSI S3 with Tigris because it is great for workloads that have a lot of data that is accessed infrequently, and I don't want to keep it locally. For example, I run a Discord bot that archives DJ sets for an online radio station that I volunteer with, Ponyville FM. All of those radio sets are stored in Tigris with CSI S3, so that people can download them, and it will treat it like a normal download, going from Tigris to my server to the ingress-nginx server to them. This setup is not ideal, but it works well enough.
Bart: Off script, would you care to share the radio station, perhaps Ponyville FM?
Xe: The what?
Bart: You mentioned the Ponyville FM.
Xe: Ponyville FM is like electronic music. It's a lot, but it's usually a lot of electronic music, dance music, trance, that sort of stuff.
Bart: Very cool. Good. Now that storage is sorted out, you needed to handle external access to your services. What solutions did you implement for Ingress and DNS?
Xe: When designing this kind of setup, everything is a trade-off. There are reasons why you'd want to do certain things over others, and there are limitations to keep in mind. As an online streamer, one of the golden rules of thumb is to never expose your home IP address in DNS, because people will find it and send a large amount of traffic to take down your entire home internet connection and ISP.
With that in mind, I needed a proxy. I chose Vultr arbitrarily and created a server in Toronto with a simple TCP proxy that proxies TCP to the right destination. It actually proxies over WireGuard inside the Kubernetes network to the cluster IP of the ingress-nginx service. When you access something hosted by My Home Lab, the traffic goes from the internet to the ingress-nginx, which then directs it to the right service and pod. Responses go all the way back.
At some point, I'll probably upgrade to more than one node, but I haven't had to care about it yet. Ingressd is a simple TCP proxy I wrote for myself, consisting of about 120 lines of Go code. It's a basic solution, but it works, so I don't have to think about it.
Bart: You encountered some interesting DNS-related challenges. Could you possibly elaborate on the issues with cluster naming conventions?
Xe: A while ago, when I worked in SRE, one of my SRE mentors told me never to use fake domains for production stuff. This is because someone else is likely to publish it to public DNS, and then your entire system breaks. This advice started circulating around the time Google announced the .dev top-level domain. Many people using .dev for internal development were left high and dry.
When I set up my cluster, I used a subdomain of a domain I already own, something in public DNS, which allowed me to get Let's Encrypt certificates for those names. I didn't want to maintain my own CA, as it raises concerns with endpoint security software.
However, by doing this, I encountered several issues due to the ecosystem assuming that nobody changes the cluster DNS name. I eventually found a Stack Overflow answer suggesting that if you do this, you're probably doing it wrong, but to fix it, you can copy the config for your custom domain to cluster.local. I followed this advice, and it fixed the issue.
In general, I have several names set for clusters. I have my home lab, which is all "rest." In the future, I'm going to have three clusters in the cloud, named Rhadamanthys, Minos, and Aiakos. These names reference a JRPG I like, where the planet was called Allrest, and they had three space elevators with those names.
Bart: Now, it sounds like a lot of effort went into this setup. I assume there were a fair amount of lessons learned from all the challenges you faced and overcame. But the work never really ends. How do you think it could be improved?
Xe: The big important part is that I've gotten it to a point where I can just put a deployment, service, persistent volume claim, and ingress into the mix, and it will just work. I put those resources in, and it will create the right DNS things, create certificates, and route things to the public. In general, deploying to my home lab is easier for me than deploying to the cloud, even when I worked for a cloud provider, which is the part that still kind of messes with me.
A lot of the stuff I'd like to improve is really about ergonomics, making it easier. When I did Kubernetes previously, I used a configuration language called Dhall. Think of it as JSON with functions, imports, and types. There was a Dhall package for Kubernetes, and I used that to make an app.dhall that had some basic configuration things that were exploded out into whatever Kubernetes wants. I would like to have something like that, but at the same time, I also kind of like having a giant folder full of Kubernetes stuff. All of these things are with Kustomize, and I can just run kubectl apply -k .
in that folder, and it will just redo everything in case something gets messed up. It is really great.
I need to have CI automate that deployment, but it's complicated, especially when you have a home lab cluster where the Kubernetes administration port is not exposed to the public. So you need to have some WireGuard set up, some GitHub action thing, and then you run into the issue where GitHub actions in Kubernetes have weird permissions requirements that they don't document. A lot of the stuff I really want to improve is just ergonomics, making it simpler. If there was a web UI that didn't suck, I'd probably use that. I have been told to try Argo CD or Flux. My last experience with Argo CD was suffering as a service, so I'm kind of inclined to not use Argo CD.
Bart: Do you think you'd give Flux a try?
Xe: I'm trying to avoid adding too much complexity because I like that it's simple enough that I don't have to think. Recently, someone on my Patreon Discord asked me, "You moved your home lab to Kubernetes a few months ago. What are your thoughts on the experience using it?" I realized that I haven't had to think about the individual nodes in a while. I just put stuff in there and it just works. That was really nice.
Bart: Now you're very generous with your knowledge and your prolific writing. We started out talking about writing about the things that don't work, putting that out there so people can be aware of it. Based on your experience building this home lab, what advice would you give to others starting their own Kubernetes journey?
Xe: The first bit is to plan for today, not tomorrow, which might exist. Plan for the today that you know exists. In my case, what I wanted to do with my home lab is get three things set up: compute, network, and storage. These three things are the fundamental building blocks of any deployment, of any workload, in anything.
When I started it up, initially, I got compute and network via WireGuard so that I could access the cluster internal service names with my browser. That was the stopping point at first. Then I tried to figure out persistent replicated storage between the machines, which, due to various challenges involving Talos Linux, pod security profiles, and instructions not working because components decided to rewrite themselves in Rust, I ended up settling on Longhorn. I was able to make that work, and it continued working after I rebooted all the nodes. The fact that it continued working after rebooting all the nodes is the important part, because that means you don't have to think about it until it becomes a problem.
In general, you're not going to get to the absolute perfect. Know your endgame, find a stopping point you can live with, and if you reach a point where you can just do stuff, that's where you stop. That's where you stop modifying things, and that's where you start adding things on the side. When I got it set up initially, I just had Longhorn as storage. Later, I added CSI S3 and the NFS Subdirectory Provisioner, which were things I added on the side and are optional to the deployment. I also added KubeVirt into the mix because I wanted to run virtual machines on the cluster.
Now, I'm having difficulty spinning up new virtual machines on the cluster because Longhorn broke something in an update and didn't inform anyone about it. That's going to be something I have to figure out at some point. I wanted to do some content around using KubeVirt on my home lab and KubeVirt in general, and I'm going to have to figure out how to work around that or find an alternative. I might end up doing nested virtualization in Hyper-V or something, but I'm completely making this all up as I go along.
Bart: Well, the honesty is appreciated. We're getting towards the end. You recently gave a talk about not wanting a booth at conferences, but making real connections with the people who attended the event. In my personal experience, I've been organizing and participating in events for a long time, and I was fascinated to hear that take on things. Would you mind sharing the gist of this with our audience about making memories, not booths?
Xe: When you have a booth with a company at a conference, you get a badge with a Badge Shield tag on it. I don't have one readily available to show, but it typically says "sponsor" or what people might interpret as "paid shill." As a result, interactions become transactional because that's what people are used to. They're accustomed to holding up their badge and getting their badge scanned with a QR code, so marketing can follow up with them about adopting a particular product or service. I understand why these things exist; people need to track their metrics to survive in a capitalist system. However, I don't want to do that. Our strategy focuses on personal interaction, talking to people like humans, and having actual conversations with those who understand the topic. This approach sets us apart, especially since we don't give people items that will likely end up in a landfill. The talk "Guerilla Event Planning at Larger Conferences" will be linked in the show notes. I had to look up the pronunciation of "Guerilla" at least 12 times.
Bart: That's okay. That's a double learning experience. I guess, have you seen any events where you would say such and such an event gets us correctly, or this is how you do it?
Xe: One of the local meetups in Ottawa is worth attending, in my opinion. Although Ottawa is a G7 capital and claims to be a tech hub, many meetups died out during the COVID-19 pandemic. The AI meetup is an exception, as they make an effort to ensure it's not just paid corporate presentations, but rather a space for actual human contact, networking, and sharing interesting ideas. At the last meetup, someone brought a desktop PC with an impressive number of GPUs, doing stable diffusion live with a webcam. I was logistically impressed by the creativity involved in fitting three 4090s into a single mid-tower chassis, considering a 4090 is a four-slot card. That's the kind of thing I enjoy. I'm hoping the Montreal AI meetup I'm attending next week will be similar.
Bart: I recall you mentioning that many meetups died out due to COVID, and most meetup organizers would agree that they haven't fully recovered. Getting people to attend events in person is tough, especially with so much content available online. I think this lack of face-to-face interaction is a risk for many of us. On the other hand, larger scale events often focus on serving sponsors, resulting in very transactional interactions that happen quickly. Meetups, however, provide a more close-knit community feel where you can share something, big or small, and no question is too junior or too stupid. It's a very welcoming atmosphere. I lead the local Kubernetes meetup in the north of Spain, and it's wonderful. We don't try to attract 150 people; 15 or 20 attendees is great for us. I definitely agree with you on that. Now, you're very busy and active, so what's next for you? What can we expect next from Xe?
Xe: I've been doing four-hour streams on Fridays on Twitch. I'm going to attempt to dynamically spin up and down instances on a website called Vast.ai, which I've been referring to as "sketchy GPUs as a service" in my notes. It's a bid-acquired compute marketplace where you can access GPUs at unrealistically low prices, making me think some of them are either former cryptocurrency people trying to get a return on investment or involved in money laundering. However, it's the only place where you can get access to an RTX 3090 for two cents an hour.
This week, I'll be writing a Kubernetes Operator to automatically provision, start a model, and run inference on an HTTP request. I'll be dealing with the Kubernetes Operator SDK, which is complicated, but I understand why it has to be that way. I have a background in writing idiomatic Go, and the way the operator SDK works is not ideal for me. Nevertheless, it's going to be a fun project. It will involve reverse engineering API calls with the browser inspector and using ChatGPT to generate Go code from curl commands. It's going to be a wild ride.
Bart: That sounds really cool. Great. If people want to follow you or get in touch with you, what's the best way to do it?
Xe: I have a contact page on my blog, Xe's Contact Page. That is Xe Iaso.
Bart: Perfect. We'll also link in the show notes. Xe, it was wonderful having you as a podcast guest. Thank you for sharing your knowledge, experience, and sense of humor. It's very much appreciated. I hope our paths cross somewhere in the world. And if not, we'd love to have you back as a guest on the podcast. Thank you very much for joining us today.
Xe: Thank you. I'm going to submit a CFP to KubeCon EU. I would be going to KubeCon this year if it wasn't in a state where it's literally illegal for me to go to the bathroom. So, if I get accepted to KubeCon EU, maybe we'll meet up there. Otherwise, who knows? It's a small internet.
Bart: Agreed. Hope to see you in London. Take care.
Xe: See you.