eBPF, Kubernetes operators, and upgrading clusters

Guest:

Lili Cosic

Discover the evolving landscape of Kubernetes through the reflections of Lili Cosic, a seasoned distributed system engineer at Replicate.

In this interview, Lili will discuss:

The evolution of Kubernetes operators and the strategic shift towards creating them only when necessary to minimize complexity.
The challenges of managing high cardinality in Prometheus metrics.
The critical importance of regularly checking release notes and conducting test upgrades on smaller clusters to smoothly manage Kubernetes workloads and navigate deprecated features.

Relevant links

Transcription

Bart: Who are you? What's your role? And who do you work for?

Lili: Hi, I'm Lily. I'm a distributed systems engineer. I've been in the cloud-native Kubernetes space for over eight years now. The main thing I've been working on is around observability, but I've also had to be on call for managing Kubernetes clusters. And most recently, I've been doing AI infrastructure things.

Bart: What are three emerging Kubernetes tools that you are keeping an eye on?

Lili: I think this KubeCon has really been about AI, but I think that is the future and I think that Kubernetes itself needs to do a lot more for native support around running AI workloads. They are a different way of running workloads compared to just a natural workload. We do use GPUs, those kinds of things. Painfully, I'm aware, are not there yet. So I think that will definitely be one of the things. And I would say eBPF has been another one where a couple of years ago, people hadn't even known about eBPF. And now we've gotten to a point where everyone knows about it. Like a lot of people talk about it. And I think those kinds of technologies will definitely be in the future. Things that allow you to see, like that makes use of the Linux system as well, all the way upwards. And I think that the last one I would say is making Kubernetes even more boring because I think that is the right way. Making it easy to operate and making it easy to manage, making it easy to upgrade as well.

Bart: One of our guests, Steven, shared some simple but effective advice on building Kubernetes operators: Keep it simple and use multiple CRDs. Do you have any advice when it comes to operators?

Lili: I've been in and around Kubernetes operators, I would say. Since roughly the beginning, and I'm really happy how we got from like, "Let's build an operator for every single thing," to now being like, "Let's only build an operator when it makes sense." So, keeping it simple makes perfect sense because we don't want to add operational overhead to an already complex application. I completely agree.

Bart: Another guest, Matt, shared that most teams store logs and metrics in Kubernetes without considering the implications of the data they collect. Consequently, they end up paying a hefty price for data that is not actually used. What's your advice on ingesting, storing, and querying metrics in Kubernetes?

Lili: I'm most familiar with Prometheus, for example, and cardinality can be really high. I'm to blame for some of those things as well, as I was adding kube-state-metrics and various Kubernetes metrics. But there are a lot of features and a lot of observability tooling that you can actually just omit the metrics and just drop them even before they get ingested.

Bart: I would highly recommend doing those like Prometheus supports that out of the box, for example. Another one of our guests, Pierre, discussed how upgrading a Kubernetes cluster isn't just about the control plane and nodes but also the tools installed in it, such as Ingress Controller, Prometheus, and more. What tools do you use and recommend for upgrading a cluster?

Lili: I think that's a really good point. That has always been a pain point. Even if you have a managed solution for Kubernetes clusters, you end up with your own workload and then the workloads that you install as well. I tend to just rely on the release notes for any projects to make sure that I see any heads up they give on deprecated things before upgrading, whether those versions are actually supported. I also make sure that I do a test upgrade on a dev cluster or an integration cluster as well. and start with smaller clusters and then take it from there.

Bart: Kubernetes is turning 10 years old this year. What should we expect in the next 10 years to come?

Lili: Like I said, I think making things as boring as possible and making things easy to operate, even more, and keeping up with the level of technical excellence that Kubernetes has had is what I hope to see in the next 10 years.

Bart: How can people get in touch with you?

Lili: I'm on Twitter, x whatever it is, Lily Cosic, so C-O-S-I-C. I'm also on LinkedIn, and I'm trying to get into writing more blog posts, so I'm happy to be encouraged by that as well.

Podcast episodes mentioned in this interview

Upgrading hundreds of Kubernetes clusters
with Pierre Mavro
Foolproof Kubernetes with GKE
with Mathew Duggan
Moving cloud operations to a Kubernetes operator
with Steven Sklar