Mastering Kubernetes: from troubleshooting to simplicity

Guest:

Billy Thompson

This interview explores practical strategies for mastering Kubernetes, from learning approaches to production-ready deployments.

In this interview, Billy Thompson, Head of Global DevOps & Platform Engineering at Akamai Technologies, discusses:

Why breaking and fixing systems is the most effective learning approach — creating deliberate problems and solving them leads to deeper understanding than following step-by-step tutorials
The importance of developing a structured troubleshooting methodology that includes creating reproducible environments with tools like Terraform and maintaining detailed Architecture Decision Logs
How maintaining a minimal toolset in Kubernetes clusters (starting with Cert Manager, External DNS, and Prometheus Stack) leads to better stability and innovation than adopting every new trend

Relevant links

Transcription

Bart: First of all, can you tell us who you are, what you do, and where you work, which is Akamai Technologies for Billy Thompson, but since Akamai is not in the links table, we will use Linode as it is now part of Akamai.

Billy: My name is Billy Thompson. I'm a DevOps and platform engineering specialist as part of the cloud CTO team. This means I'm involved when the conversation or customer engagement relates to my area of expertise. I work for Akamai in their cloud computing pillar, which came about through the acquisition of Linode.

Bart: Now, what are three emerging Kubernetes tools that you're keeping an eye on?

Billy: I remember when many people were trying to use Crossplane in production and were very happy with it. It has really been taking a turn for the better. I've seen large enterprises rely heavily on it. People working for universities and other large networks have expressed interest in using Kubernetes products, but only if they are Crossplane providers. I think Crossplane will continue to be a rising star. Another emerging technology is Backstage. Although I'm not a huge fan, the way it has exploded in popularity and the amount of love it receives is interesting. The use of Backstage and the frustrations people encounter while trying to scale it are very real. However, the fact that the community of developers loves it and wants to keep finding ways to use it and improve it is an interesting story. It reflects the story of Kubernetes in general, which was initially difficult for many people to adopt but ultimately filled a gap in the market. I'm following the progression of Backstage, and I'm also curious about Flatcar Container Linux and the resurgence of Container Linux, which reminds me of CoreOS in the early days. Additionally, I'm interested in the expansion of Kubernetes-based hardened distributions like Talos. I'm a fan of Longhorn, specifically because of the ease it brings with ReadWriteMany. However, I haven't seen how it holds up under heavy load and at scale, so I'd like to explore its limits and potentially contribute to it.

Bart: Speaking of which, it gets us to our first question from one of our podcasts. On the subject of learning by doing, our guest Matias suggests that the best way to learn Kubernetes is by doing and getting your hands dirty. He then built his own bare metal Kubernetes cluster in his spare time. What's your strategy for learning new Kubernetes tools and features?

Billy: Definitely, learning by doing is the best approach, because for me, personally, following a tutorial where everything works if you follow the steps just doesn't absorb as well. I'm not as receptive to that type of learning. Learning by doing, actually getting your hands on it, is more effective. However, I think an even better approach is learning by fixing. Create a problem and then solve it. If it works, that's great. But break it and fix it from being broken, because when you're problem-solving, you're using more parts of your brain. You're using your problem-solving skills, your creativity, and you're activating your reward centers. Those are the lessons that you never forget, because of the steps you had to take to troubleshoot the problem, the solutions you had to find, the documentation you had to comb through, and the experimentation you had to piece together. The joy of getting to the other side is, for me, the best learning experience. It's a "jump in the shark tank to learn to swim" approach, but it's worth its weight in gold if you can embrace it. Troubleshooting is a blessing because we learn from it. Unfortunately, in production environments, we often just want to get something working. But if you're learning something for the sake of learning it, to become more familiar with it, try breaking it or have someone else break it, and then learn to fix it.

Bart: On the subject of troubleshooting tips and the learning path, our guest spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. He stressed the importance of learning while troubleshooting. Is there any practical advice that you've learned during the years regarding debugging?

Billy: The best advice I can give is to consider troubleshooting as a blessing. One of the best skills you can develop is your troubleshooting methodology, which involves cutting a problem in half, narrowing it down, and filtering out things that can be ruled out to find the solution. Developing this skill puts you in a flexible situation, especially when working with teams. When my team is hiring, we look for this trait. We set up a lab with pre-existing problems and assess how candidates approach solving them. We're not expecting them to pass, but rather to demonstrate their thought process.

If a candidate's approach is too reliant on copying and pasting or simply looking for a quick fix, it may not be sufficient in fluid environments like the cloud or the CNCF ecosystem, where best practices are constantly changing. The ability to diagnose and solve problems quickly is crucial. My team operates on the principle that we don't know everything, and we never will. However, we're confident that we can figure out any problem that comes our way because we've developed a strong troubleshooting methodology.

Some things that I find helpful include keeping an Architecture Decision Log, where I record my architectural decisions and troubleshooting steps. This log can be as simple as a text file, and it helps me reevaluate my process and document my findings for future reference. I also try to write my documentation in a way that's easy for others to understand, as if I were teaching someone else. This approach helps me to create better, cleaner, and more readable documentation.

Another common step I take is to try to cut the problem in half and narrow it down as much as possible. I also consider how easily I can replicate the problem. Being able to replicate the problem makes a significant difference, especially when working with limited information. If I can provision an environment to test my theories and solutions, it's much easier to troubleshoot. I use tools like Terraform, Ansible, or Pulumi to quickly spin up and tear down environments, which helps me to test my theories without having to go through a lengthy process of walking backwards and trying to undo changes. If it takes me a few hours to set up this environment, it's worth it if it helps me solve the problem more efficiently in the long run.

Bart: On the subject of using boring, simple tech, our guest Billy Thompson prefers using fewer abstractions in his Kubernetes cluster to limit the tool's cognitive load. Despite his effort, he and his team hit a somewhat unusual incident with Pod Topology Spread Constraints. What is your approach to installing and maintaining tools in your Kubernetes cluster, a strict diet or all you can eat?

Billy: My default approach is to follow a strict diet, especially with all the moving parts of Kubernetes. The "all you can eat" approach is not the best one, in my opinion. A strict diet is essential because stability and the basics matter. If you can't master the basics, you can't master the hard stuff either. Most cloud-based workloads use basics such as VMs, compute instances, storage, and a networking overlay. When provisioning a Kubernetes cluster, you need the basics, including Cert Manager, External DNS, Prometheus Stack, and monitoring.

I start with the essentials: Cert Manager, External DNS, Prometheus Stack, and monitoring. Once I have Kube State Metrics and monitoring in place, I know it's working. My ingress is working, and then I can proceed to add something like Argo CD and work towards a GitOps strategy. Keeping it simple and mastering the basics is key to stability.

Some of the most successful companies I've worked with have built incredibly innovative things, and they stick to the basics as much as possible. This approach reduces technical debt and the amount of complexity to keep track of, enabling them to foster more innovation. With a stable base, you can iterate, try new things, and add on as needed.

A common pitfall is companies taking on too much too quickly because something is a hot trend. For example, multi-cloud is a complex topic, and jumping into it without mastering the basics can be challenging. Many companies have been talking about doing it for years and still haven't had a successful landing with it. The key is to start small and move forward incrementally.

It's no different than decomposing a monolithic application into microservices. You should start simple, chop it up into a few components, and expand from there. This approach enables you to iterate and explore in a digestible way. Going overly complex is always going to hurt, almost all the time. You can say the same with platform engineering; companies can spend years trying to get it down, still struggling to do it, versus making it more basic with golden path templates and growing from there.

I've been exploring Pulumi lately, and I've become a fan of it because it offers flexibility and the ability to use your programming language of choice. I also like that it allows for test-driven development methodology for infrastructure. Another area I want to explore is distributed state, particularly in Kubernetes environments and multi-region clusters. I'm interested in finding the simplest path forward using pure open-source cloud infrastructure primitives to maintain consistent state in a geographically distributed environment. There are many possibilities, including using NATS or Cassandra, and I think this is a topic worth building out and producing content on in the next year.

Bart: If people want to get in touch with you, what's the best way to do that?

Billy: Typically, I'm only on LinkedIn for social media. I guess LinkedIn qualifies as social media, and it's the only place I tend to find myself. I never had a Twitter or Facebook, because I had a MySpace way back when it was popular. When MySpace stopped being popular, I wasn't really in front of computers at that point in my life. So, social media hasn't really taken over for me, except for LinkedIn.

Podcast episodes mentioned in this interview

Kubernetes on bare-metal: lessons learned
with Mathias Pius
Pod topology spread constraints might not be the best solution
with Martin Humlund Clausen
Troubleshooting a validation webhook all the way down to the kernel
with Alex Movergan