Kubernetes needs a Long Term Support (LTS) release plan

Kubernetes needs a Long Term Support (LTS) release plan

Host:

  • Bart Farrell

Guest:

  • Mathew Duggan

This episode is sponsored by Learnk8s — expert Kubernetes training for your team

With the rapid pace of the cloud-native ecosystem, staying current with Kubernetes updates and managing upgrades becomes a daunting task for many organizations.

In this KubeFM episode, Mat discusses the necessity of long-term support for Kubernetes and explores the intricacies of managing Kubernetes upgrades in a fast-evolving landscape.

You will learn:

  • The importance of long-term support (LTS) for Kubernetes and how it can alleviate the challenges associated with the platform's rapid release cycles.

  • Strategies for managing Kubernetes upgrades, including insights into the release cycle and the potential pitfalls of the upgrading process.

  • The role of managed services and semi-automatic upgrades in simplifying Kubernetes maintenance for organizations, especially in cost optimization and resource constraints.

  • The implications of charging for support of older Kubernetes versions and the potential for a community-based approach to navigating the complexities of Kubernetes upgrades.

Relevant links
Transcription

Bart: Sometimes it's not enough to have a guest just for one episode, but to bring them back for a second. In this episode, we're going to be with Matt Duggan for the second time. In this episode, we'll be speaking about long-term support for Kubernetes. After all, Kubernetes is turning 10 years old this year. So people have been working with it for quite some time. I think about all the things that go into upgrading the new versions, some of the challenges that can arise around pricing for previous models. Nobody wants to make mistakes with all the new stuff that's coming out all the time with the very fast release cycles of Kubernetes. So it's really important to keep these things in mind when it comes to long-term support. There are also elements that can be related to managed services in upgrade processes, all the kind of things that Matt will be walking you through this episode. There's also a content warning because we did spend some time talking about conspiracy theories, and I may or may not have put on a tinfoil hat. This episode is also sponsored by Learnk8s. Learnk8s is a training organization that helps engineers all over the world level up their Kubernetes skills. From beginners to advanced, there are courses that are online and also in-person. And students that take the courses have access to the materials for the rest of their lives. Let's take a look at the episode. The cloud-native ecosystem moves really, really quickly. How do you stay on top of it? How do you keep up? Is it blogs? Is it podcasts? What do you do?

Mat: I'm a big fan of a couple of different things. Obviously, I like Lobsters, which is more programming-centric, as opposed to Hacker News. I love KubeWeekly. I think it's an amazing resource for discovering this stuff. And then, yeah, I mean, it's a lot of just RSS feeds. A lot of the cloud providers are really good about listing the new advancements. I also try to go to all the conferences online that I can. I find that the conferences are probably my number one way of staying plugged in and figuring out what's going on. And if you could go back in time and give one career tip to your younger self, what would it be? I think I would go with being more active in the open source community. I think that I was very apprehensive when I was a younger programmer about that sort of thing, feeling like I wasn't really up to snuff for it. And I think that was a huge mistake. I think it would have been super beneficial to younger me to see how software is made at a larger scale and to get an idea of how those things work, especially team management, which I think open source software is such a unique learning environment for younger technology enthusiasts and engineers and things like that.

Bart: That's a really good point. A lot of times, I find that people can get a little bit overwhelmed by how immense the open source community is. It might feel like they don't have something to offer necessarily. What would you say to that for people that aren't sure about what's the best first step to take?

Mat: Sure. So I actually find there's a lot of resources. I think there's a website called My First Issue, which sort of outlines good first steps. But the first one for me that I always try to do is, if I'm looking at a new open source project, I'll usually start with documentation. There's almost always room for improvement there. It's usually low hanging fruit, just explaining something new or going into more detail or producing a few diagrams. But it gets you plugged into how that team works and who the core maintainers are. and how they expect things to be presented to them. What's their decision-making process and things like that. I also really love, for people who are a little bit overwhelmed, any sort of Linux packaging. It sounds really complicated, but really what you're doing is you're taking someone else's code, you're putting it into a package, you're versioning it, you're signing it with a GPG key, and you're pushing it up. It's process intensive, it's very important, but it's really like, people can guide you through the steps one by one and you're not ultimately producing the software. So I think it's a good place to jump in if you just want to get into the ecosystem, but you're not necessarily ready to start writing a whole application.

Bart: In a previous episode, we talked about your experience at GKE. In this episode of KubeFM, we want to focus on an article that you wrote about why Kubernetes needs an LTS. To go a little bit further on that. So in the previous episode of KubeFM, you mentioned that most startups you've seen and talked to have an existing cluster and are afraid to touch it out of fear of breaking it. How did we get to this point? And is Kubernetes really that fragile?

Mat: Yeah, so it's a great question. So I think the place where I'd like to start is this idea of Kubernetes expertise still being a rare commodity. It's still hard to find people who are good at operating in this ecosystem. And if we're honest with ourselves as an industry, it's because there's a lot of layers of abstraction. You have the physical layer, you have the virtual machines, you have the cloud provider, you have Kubernetes, and then you have network overlays and service meshes. So these things become really complicated and hard to diagnose. And so I don't think that the underlying cause here is that Kubernetes is fragile. In fact, I think the project has made great strides in making it more reliable to upgrade than ever. So I think the core maintainers have really tried to get ahead of this problem. But it's in part because of its own success, right, that we just have this massive ecosystem of third-party integrations and applications that run on Kubernetes now. And it's tricky for teams, especially teams that are maybe not as plugged into the ecosystem, to figure out like, is it safe to upgrade? How can I upgrade? What are my dependencies and what are the checklists that I need to go through?

Bart: I suppose with that too, in terms of the ecosystem, the speed at which it moves, that like you said, some teams aren't necessarily familiar with that. Perhaps their experience working with open source in general might be limited, and with Kubernetes even more so, as you said, it's a scarce commodity. One thing that some people might not be familiar with are release cycles in Kubernetes. It's something that's happening nonstop. Can you explain, can you just walk us through really quickly, what is a Kubernetes release cycle?

Mat: Sure. So Kubernetes as a project has adopted a really aggressive cycle of releasing software. They're on a four-month release cadence. Effectively, Kubernetes releases a binary release. Right. It's like one dot two something, or basically the version of Kubernetes is on a 14-month cycle, 12 months of maintenance, two months for upgrades. Kubernetes is coming out at a really fast pace with new software, especially when compared to traditional Linux distributions. This is a really fast pace of upgrading, and you do want to upgrade because in the ecosystem, third-party maintainers, Istio is a great example, are only maintaining compatibility with the last three versions. This is becoming pretty standard. But that's the aggressive release cycle. What cloud providers have done with the release cycle, as I discussed in the article regarding LTS, is some have added two months, some have added 12 months. They've all adopted their own methodology for how to extend this window. I think to get around this problem of people needing to upgrade all the time.

Bart: And is there anything particular about the upgrades that might be more problematic or perhaps we'll be looking at some of the remedies and workarounds. Why is the upgrading process so problematic compared to other types of software that are out there?

Mat: Sure. So Kubernetes is great about versioning its APIs and declaring those APIs so that you can check them for your own internal applications. But it requires either the use of a third-party utility or a pretty good understanding of what your third-party installations, your CNI, your storage network, everything else that you're running inside of it is expecting to run. Cloud providers such as GCP have stepped into this space and said, if we see part of your cluster making an API call to an outdated API, we won't allow you to upgrade. And there are third-party utilities that do the same thing as Pluto. But you sort of need to know about these utilities in order to get their value. And then the second part is to do a Kubernetes upgrade safely, you do have to be plugged into the ecosystem. You have to be reading those release notes. You have to understand what your cluster is doing, what software you have installed, what it has as dependencies, and where you might run into problems. I also think that the growth in using cron jobs and stateful deployments has made this much more complicated. I think when people started moving their database into Kubernetes, which is fine and supported, it made it even more nerve-wracking for people to think about persistent volumes, detaching, and reattaching disks. All these things added up to be pretty scary for people. And again, for a lot of these smaller teams that I talk to, there's just not a great understanding of even what is happening behind the scenes with Kubernetes. What does it mean to upgrade etcd? What does it mean to upgrade the control plane? All those sorts of things.

Bart: Great point. And I think, once again, for teams that could be smaller or also depending on their level of knowledge and expertise with Kubernetes, it can be more or less challenging. Are they really expected to do these things all by themselves when it comes to the upgrades? Or is it something that can become a community-based solution?

Mat: Well, I don't know. I mean, upgrading has been something where I think it's hard to break off a solution or it's hard to come up with an open source solution that really fits for all use cases. Right. What we see is some automation around teams that allows them to more easily approach these upgrades. But fundamentally speaking, we are still talking about the same process we've been discussing for years, which is teams have to decide whether they are going to do an in-place upgrade and account for some version skew allowed between the control plane and the kubelet, and say like okay we're going to upgrade the sort of node worker groups and handle that as an in-place process. Or are we going to do a blue-green deployment, and say there are two clusters running different binary versions and we're going to switch traffic between them. I think you see teams approaching it from all different perspectives, depending on their risk appetite, effectively. But especially in this era of layoffs and cost sensitivity for businesses, it's tricky because a blue-green deployment means two clusters, and you're paying for two clusters. And if you run into a problem, you'll pay for two clusters in perpetuity until you solve the problem. And then the in-place upgrades are scary because rollbacks are scary. And in Kubernetes, it's not easy to roll back. You can have backup solutions like Velero, you can go back to a known good state, kind of, but it's not a perfect solution. And anyone who's ever sat there with a bad Kubernetes deployment can say, that was really nerve-wracking. It was really hard to figure out what was going on. I got really spooked. So I think that's sort of like... We have a way forward, but the community hasn't really stepped up with anything that's completely bulletproof. You have technology like Kubespray that is trying to make this more simple and more idiot-proof. And you also have technologies like Flatcar Linux that is saying, we'll even remove the node operating system component of it. So you have Kubespray taking care of the whole Kubernetes element, and technology like Flatcar taking care of the Node OS. So there is progress happening, but it's still fundamentally the responsibility of that team.

Bart: I like the contextualization that you put out there too, that right now with layoffs and cost optimization or sensitivity, that's also environmental factors that are going to affect the current situation for teams. Cause it was different six months ago. It was different two years ago. But if we want to dive deeper into managed services and semi-automatic upgrades, can you tell me more about that?

Mat: Sure. So I think all of the cloud providers have adopted effectively an upgrade system. So EKS, it used to be a little rocky. They effectively were taking care of the control plane. You were taking care of the nodes. They've simplified that and improved that. Karpenter is one technology that's assisted with this whole process. Azure, similarly, has increased the ease of upgrading these versions of Kubernetes. And then GCP has sort of a whole suite of tools, including autopilot, that makes it completely hands off and they handle the whole upgrade process for you. In fact, GCP says like, we will upgrade it for you by default unless you specifically tell us not to do it. They have that sort of level of confidence in that API catching technology. So the cloud providers have really stepped up to try to make this more simple. And I think for most users, most of the time that covers a lot of the concerns. But if you're trying to do... The stories that I hear mostly from people who are running into problems are, again, I have a data center. I'm trying to bridge it with the cloud. I'm using Kubernetes and a service mesh as that linking technology. But obviously, I have to keep the data center in lockstep. And I don't have unlimited access capacity in that spot to spin up a whole new cluster and all that jazz. Price sensitivity. And then also, just... concerns about, before we do an upgrade, we want to upgrade everything we're relying on. We want to upgrade our CI/CD process. And the thing that I hear from startups is effectively like why, right? Like it's a huge amount of work. It's effectively one person's part-time job inside the org to keep these things plugged in to say that we're on the latest version. We're upgrading, we're moving forward. And the feedback that I get from the smaller groups and studios is, the cloud provider kind of helps me. But I'm still ultimately assuming the liability, still turning on backups, not turning on backups, that sort of thing. So the cloud providers have done good work in making this easier. And different ones are taking different approaches. And I think that we're in a better position than we ever had been before. But you are still ultimately assuming the liability for saying, this thing is going to work and that my individual use case is not going to be impacted by this. Thankfully, I think we're in a place where people are doing less creative stuff inside Kubernetes, which means it's easier, which is great. But a lot of times, whenever you move away from the most common use case like a web app running a stateless cluster and start moving into more complicated, sometimes GPU-based operations for a lot of the machine learning or LLM stuff, the complexity story starts to increase quite quickly.

Bart: Hmm. And with that in mind, with some companies, some organizations still running older versions, since you posted the article, support for older versions of EKS are now being charged at a premium. So, do you think it's fair to charge to keep old versions of Kubernetes running?

Mat: So this has been a debate in the Linux community for as long as I've been around. And so the basic argument here is backporting security fixes is work. And the work has to be done by someone, it has to be tested, and you have to apply these updates. So the way that distributions have done it in the past, obviously, Red Hat charges a license. Debian has the LTS group, which is effectively companies fund it to keep older versions and distributions around. And that's all great. But the reason why I get a little eyebrow-raising about the price is, this is a one-time cost. You're running Amazon Linux, you're running... There are a set number of distributions, they know all the configurations, they probably know what CNI you're using, they know the storage mechanisms, they know your ingress controllers. So the cloud providers are really... The reason why I sort of get eyebrow-raising about it is because it's work that they would almost certainly have to do for large customers in order to book them as large customers. They only have to do the work once and then they get to charge every cluster effectively per hour. So it's a great revenue generator. But I don't know if it's completely fair. I think that it does, in some respects, take advantage of the fact that these cloud providers know that some of these organizations simply don't have the expertise in-house, have already made the migration to Kubernetes and are somewhat captive audiences at this point.

Bart: And in the case of non-managed services, how can the team keep up with the release cycle? It sounds like the process could be automated to a certain extent.

Mat: You can definitely automate. I mean, the debate and a lot of the reaction to what I wrote on Hacker News and other sources was this dichotomy between these two teams. A bunch of people were saying, "I have automated this to death." And whenever I need a new cluster, I type in one command and my CI/CD system spins up a new cluster. And I've never had a problem with this ever. And I think it's great. And Kubernetes should change nothing. And then you hear from other people who are like, "My organization or my organizational structure makes this really hard to automate this problem away." Banks, FinOps, air-gapped Kubernetes clusters. I heard a lot from people in the defense sphere or the government contracting sphere where Kubernetes is starting to expand as a solution, but where they're like, "Our change management process simply doesn't allow for, we have to plan these things far in advance." They are a large amount of work. There's a whole bunch of checklists that have to go through. So the idea of pulling down and making a change, a substantial change every four months is simply impossible for us. And so it was interesting to see the groupings there. But the short answer is yes. But the secret here is you cannot go off script too much. The closer you keep your cluster to stock, the easier this is going to be. Because Kubernetes has made the upgrades to the control plane into these primary processes safer than ever. They roll back failures. You can try them again. So if you're in that sort of place, I think that you can safely say, "We adopt upgrades." We have separate node groups. We keep the version skew as minimal as possible. And we're continuously rolling out new node worker groups to test the versions of kubelet on those devices. So I definitely think that organizations have done the automation, but as far as I can tell in talking to people, there is no industry-wide standard for how to write this automation or what is controlling it. And I think that's the primary gap right now is you can do it. Everyone writes their own. I don't know if that's a great idea, but that's the place that we're at, which is every single time I talk to a medium-sized organization that has done this, they've come up with a different way of doing it.

Bart: Has anyone ever proposed the idea of stopping all Kubernetes releases, like the idea of no upgrades, no problems?

Mat: I mean, I'm sure someone has. I... I think what I heard a lot was more in the place of what if we made it read-only, right? And I think that that was more of where the conversation was sort of transitioning. Obviously, Kubernetes has state and it's sort of part of its magic or its process. And this is something that I've heard forever. And it's like, could we transform Kubernetes into effectively a read-only system, get rid of the etcd concept of it and store these things more statically? The reality is I don't think that you can. So I don't think that you'll ever be able to create a new version that's the forever version. And also because I don't think you'd want to, because again, the team is still coming out with such great ideas. So I think that it's great that we're still getting those versions and getting those improvements. But would there be a demand for a forever Kubernetes? Yeah, I think so. I think if you said like, yeah, this is like a statically compiled, install once, run forever, don't let it on the internet version. I think that people would be like, yeah, I'm super interested. Like, pitch me more. But I don't know if anyone's really talking about pausing the state of development. I do think that, though my personal theory—I'll put on a tinfoil hat for a moment—part of my personal theory is like some of this is impacted by the Google development culture, which is a great development culture, but they move fast and they don't really believe in static versions of things. Chrome was a browser that continuously upgraded. Chrome OS's whole sales pitch was that you're always getting updates. You know, when they ran into upgrade problems with Android, they decoupled the two systems. And I think that you see a lot of that development mentality inside of Kubernetes as well, where they say, obviously, perpetual change is the inevitable conclusion of all systems. And I think it's reflected in all of their design as well.

Bart: Okay. Hold that thought. I'm going to go get some tinfoil. Ha ha ha ha ha!

Mat: Perfect. I love it. It looks great.

Bart: Yeah. It's been a while since I wore a tinfoil hat.

Mat: Ah, yeah.

Bart: Okay, good. So now that I've got my tinfoil hat on, you can do that at a later date. But I do think it's, like you said, a couple of things. Those sensitive use cases that you mentioned earlier, like governments, FinOps, and highly regulated industries. I think some of those might be wishing for a forever Kubernetes and not having to worry about these upgrades, the security factors that come along with that. But now, if we're looking at Kubernetes, a lot of it will relate to governance. And so if we're talking about LTS, LTS Kubernetes would probably have to be agreed upon in some court of SIG. For folks that aren't familiar, a SIG is a special interest group. There are many of them in the Kubernetes ecosystem. So what do you think about that?

Mat: I think there is existing work group. And so I've followed the progress of the work group. And I talked briefly about that work group. I think my personal theory is that there wouldn't be a tremendous amount of pushback about the concept of an LTS release. I don't think the maintainer organization at large would totally block this. I think that the part that would be tricky. And I think that the part that we would all have to accept. If we were to move seriously with the concept of an LTS, this concept of taking a binary version of Kubernetes and saying this binary version will receive security patches long past the original window, two years, three years, maybe, is I think we would have to internalize the idea that there was no way off of that version. To say at the end of this version of LTS, there's no upgrade path out. You have to create a new cluster and migrate your stuff off of it. Now, I think that the community would step in. I think that we would have an opportunity for third-party consultancies and other organizations to create that path off. But I think that that would be perhaps one of the largest sticking points in any sort of work group inside of Kubernetes is to say, we've gotten users onto this version, we've committed to backporting security fixes, they've reached the end of life of this version, what do we do with them? Right? So it would have to be leadership level agreement that we're saying, that's it. Game over. Make a new cluster. Migrate your stuff. And then sort of treat it in some respects as we've treated Linux distributions for a long time, which is like you can technically upgrade in place, but best practice for a long time has been to say, okay, just spin up new virtual machines, migrate the stuff over, cut over as a version of Red Hat or whatever. And the working group, what happened to it, why was it dismantled, and do you think it might be coming back at some point? Yeah, so the working group seems to, I think there was a belief that due to some of the stability around upgrades, the working group was no longer required. The working group has sort of come back with a very interesting proposal and I think really kind of a genius idea. Not exactly what I was envisioning but, they're smart people so maybe it's an even better idea. Basically, what they're talking about is decoupling the concept of the binary version and the compatibility version. So you would say, okay, I have installed 1.2.9, but I want to set my compatibility to 1.2.7. And so it would allow you for a lot more flexibility in terms of, yes, I want the latest and greatest for security patches on the control plane and the webhook layer and all those things. But on my node level, I want to keep my versioning in the past. and sort of slowly tick these off. So they would really assist with troubleshooting. Now, the concern that I have with this approach is it's amazing for cloud providers, and it's amazing for large-scale organizations that have that sort of in-depth understanding of what exactly routes are they hitting, what changed between those versions, and what they were depending on and what that behavior looks like. but the concern that I have is it sort of goes back to the original point of the article, which is like small and medium sized teams simply don't have that level of fine grained understanding of their clusters necessarily. And what an upgrade entails for them. And when they encounter a problem, the cause of that problem, right? Is it a Linux related problem? Is it a container related problem? Is it a virtualization problem? Is it a monitoring problem? Often these organizations are just like, we got it set up once, it's running right now, you know, that required sort of 100% of the infrastructure team's effort for some period of time. And like, we just don't want to break it. So I think it's a really good idea. I think that if we could take that concept and couple it with better tooling, I think that we could get there. But it's not really an LTS version. And it's important that people understand that the proposal being discussed right now is mostly about this idea of you can lock in your compatibility version, which is sort of an LTS thing, but you're still running up against this sort of global clock. And you would have to have a high degree of confidence in this feature flag concept. to say I have turned off compatibility version above, and I am on the older ones, and I know that nothing is going to change as a result of that. and it's a ton of work. So it's a big proposal. It's a relatively substantial change to the way that Kubernetes works and functions.

Bart: So in the next release, should we expect some kind of flat earth lizard kingdom to be coming out?

Mat: I mean, again, I think it's a genius idea. I think that the folks involved in the WG LTS (Working Group Long-Term Support) are really trying to solve this problem in a scalable and sustainable way. My suspicion or my hope would be that we sort of come up with a simplified package. So we say, like, there is this version, it is called this. Because again, I think that this is very amazing if you're very plugged into the project, but if you're an outsider looking in, this doesn't mean anything. Controlling the feature flags of specific API routes is going to be hard for people to conceptualize. But I also think that if this moves forward, it's going to be an amazing opportunity for cloud providers to say, don't think about upgrades anymore. Just tell us what version of Kubernetes you want to stay on. And we're going to manage the binary upgrades completely ourselves. And we're always going to be rolling out the latest. We're always going to be rolling out the greatest. And you just stay on that one conceivably for as long as you want. And so I do think that at a high level, especially for managed solutions, this could be an amazing long-term solve. But I don't think that for teams running their own clusters, it's going to be as applicable without investing a lot of time in internal tooling.

Bart: Okay. And to finish up, backtracking to a previous point about one Kubernetes for life. You mentioned the read-only aspect. If Kubernetes were to release an LTS release, then all the other releases might lose their relevance, or teams would just stick to LTS and not bother with the rest. Is that a concern?

Mat: I mean, it's a legitimate concern, but I don't think that most teams would do that because especially for organizations, if you are a Kubernetes first organization and you are a cloud first organization, these upgrades are not that challenging. And a lot of organizations have adopted a multi-account, multi-cluster system. where you're talking about hundreds or thousands of clusters. And in those kinds of situations, there's really very little incentive you would have to go in this direction. I think that what we're ideally trying to do is those organizations have adopted Kubernetes. They're in great shape. But, as a small example, I find out that the grocery store chain near my house is using Kubernetes. And they have a cluster in the store. And to me, that is the perfect example of, you have a bunch of retail locations. you have hardware on site to these locations. I've heard similar stories for Chick-fil-A. I've heard similar stories for IKEA. And, you say, okay, we want Kubernetes to grow. We want them to go in that direction. we can't constantly be changing these variables on them. And if they had more time to sort of sit and bake. But I think that for most internet-centric companies, the upgrade process is not getting harder. I think that most people are getting more proficient at it. Or they're simply saying, I'm just going to wait until my cloud provider forces me to upgrade, and then I will deal with it then. I'm seeing both of those strategies for the online community. But I think that the place that the LTS place would fall into originally, and probably for a longer period of time, would be to say, I want to run the same version across my data center as I want in my cloud. Perhaps I don't intend to stay in my data center forever. but I need a version where I can do a lot of work in the beginning, get everything set up in Kubernetes, start migrating the things over to the cloud or move my workload primarily to the cloud. And I don't have to have a series of smart hands in there all the time, dealing with that sort of configuration management and hardware maintenance and things like that.

Bart: I really like that. It goes back to the basic question of what are your objectives? What are your needs? And so, like you said, with the case of Chick-fil-A, because I got to meet actually one of their chief architects at an event last year in Holland. And he was speaking about the 3,000 different restaurant locations that they have, Kubernetes is right at the edge. And so that's the thing for an organization like that to constantly be keeping up with these upgrades could become a bit of a nightmare. However, as you rightfully pointed out, internet-based tech companies are much more accustomed or getting accustomed to these release cycles and the upgrade process. With that in mind though, is upgrading a cluster more of a mental challenge, a technical challenge, or a combination of the two? Where does that lie?

Mat: I think it's a combination of things. First and foremost, it's an incentives challenge. I think that a lot of teams don't necessarily... If you're not plugged into the Kubernetes ecosystem, If you love Kubernetes and you love this product and you read the release notes, A new release is an opportunity for you. You're saying, "There's a new thing that I can take advantage of." There's new stability or a new... When I read about Gateway API, I was like, "This is amazing." We're decoupling the concept of HTTP routes and load balancers. What an incredible advancement. I couldn't wait to get on board. But for a lot of the teams that I've worked with in the past and continue to talk to, that's not where the incentive structure lies. They are a small team. They are trying to deliver features to their users. They're trying to ship software. And so, upgrades to them are only perceived as risk. They are not perceived as benefit. And we see this a lot with programming languages end of life. We see this a lot with supply chain management on the actual application side. Like how many times have you walked into a small company or talked to a small company and you say, "What version of Python or what version are you using?" And they say, "We can't upgrade because we have this package dependency and no one's ever gotten around to solving it." So we're just going to keep using the old version forever. And you say, "That's pretty rough. You should probably get on top of that." I think that same mentality comes into the cluster space too, where they say, "Well, it's running and it seems to be fine." "I don't know, I'm pretty happy with it." So I think it's an incentive structure, but I also think it's a complexity question. Like, when you ask them, a common thing in a checklist to do an upgrade, is they say, "Come up with a disaster recovery plan." This is great advice. You should have a disaster recovery plan. You should say, "I'm going to start upgrading the cluster. If it goes completely sideways, what am I going to do?" And for a lot of these teams, that is an inconceivable situation to say, "What would I do to recover?" Effectively, what we would be saying to them is, "Run your infrastructure as code, make a new cluster, point your deployments towards this new cluster, cut over your DNS entries, and hope for the best." And that, for a lot of folks, would be effectively their recovery system. Or you have a backup solution. Like you say, "If you don't have complete confidence in your infrastructure as code, run your backup, restore your backup, hope for the best." But that again is a huge risk. These applications, all applications are relatively fragile. All it takes is one environmental variable to be missing, one certificate that someone inserted by hand that didn't get caught by the backup solution, and your application isn't coming up, and you might not know why. I've even had, when I was doing load testing, a strange issue with a service mesh. And I was thinking to myself, "I know a lot about these things, and I'm lucky to have a team that supports me." But if I was just sitting here and Linkerd started to die, and the proxy on every single pod wasn't accepting traffic, that's a catastrophe. And you may not even know where to start debugging. And I think that's where a lot of people are. It's so ingrained in their system. It's handling so much. It's so important that they just can't conceive of a way to recover from a failure. And so they get scared and they walk off. So it's a combination of incentives, technical expertise, and psychological factors. I think Kubernetes has also developed a reputation for being complicated. And I think that people repeat that perhaps too much. And I think that it psychs people out.

Bart: I think it's a very fair point. And going back to the beginning, when we talked about the cloud native community being immense, but for some people, it feels overwhelming because of its immensity. Kubernetes being the second biggest open source project in history after Linux, among others, contributes to a large ecosystem. Is this, as someone with a fair amount of experience and being risk averse, fear of liability, and all these concerns, something that you think develops over time in someone's career, or is it something that can be static.

Mat: I think that in my experience, I'll speak for personal experience. I think that there's sort of a weird curve to it. Right. So I think when you start out in this field, you're extremely risk averse because you're really, really afraid of breaking something, which makes perfect sense. You come into a job, you have imposter syndrome. You say, I don't know what I'm doing. These things are so complicated. I don't want to break anything. And so what you see is, when you're in a leadership position with people who are just coming on board, you sort of have to push them a little bit to say, we need you to change some things. You know, there's the famous story of Facebook, like commit to prod on your first day. A lot of organizations have adopted these things, I think, to get you into the pool. And so, but I also think that there's this unfortunate period which is where some people get into trouble where they say, I've now been cloud native for a while, I feel like I have a pretty good understanding of how these things work, I'm a lot more cavalier with my upgrades, I'm a lot more cavalier with my changes, and then you get burned hard. And then I think that that's where you develop a much more conservative approach to some of these changes. It's hard to come up with a balancing act because everyone has that heart-stopping moment in their careers when they're like, Oh my God, I've broken something critical and it's not coming back up, everyone's freaking out, and I'm not exactly sure what to do to fix it. I can think of like five off the top of my head for myself. And I'm sure people with more experience than me can come up with 20. And so I think that you do end up in the position where you say to yourself, I am afraid to make those changes because I ultimately just don't want to be the person that's responsible for letting down my team. And so there's also that psychological element. But I think that the position in your career changes how risk averse you are. And I think that some of the younger folks, to their credit, are less obsessed with these concepts of stability. I think that they have much more confidence in the cloud providers. And I think that they have grown up in this ecosystem and matured in this ecosystem. And I think that they are much more likely to jump on the process of let's just continuously change. You know, when I started, it was like run books and making a software release was like we had a physical book that we would go through and follow the steps and say, we push them out to the servers.

Bart: And now, that concept is like describing the stone ages to someone where they're like, "I deploy 500 times a day and I've never seen a server." Good points there. It's interesting that you mentioned the point about imposter syndrome affecting younger folks and also the Facebook example of developing in prod and in the very beginning of someone's time there. I don't remember where I saw this, but imposter syndrome is something that's discussed quite frequently nowadays, or certainly feels like it's discussed more than when I was younger. But I heard somewhere, and I really should double check this, then again, I'm wearing a tinfoil hat, so my sources don't really matter, right? Is that the higher up someone goes in terms of seniority, the more imposter syndrome is going to affect them in the sense that, "Oh, you're supposed to be an expert or that you have to be right about absolutely everything." So there's even more pressure than, let's say, a junior developer. Or a junior SRE. Is that something you would agree with?

Mat: Oh, yeah. I mean, I definitely remember the first time that I came into a job at a startup, and I was hired as a senior infrastructure type person, like implementing good practices. And I remember my first week, I realized, oh my God, I'm not a senior person. I'm the most senior. I went into the meeting room and had a little bit of a panic attack. Because this was the first time in my life that I was, I'd always been a junior sysadmin, line engineer, in the middle of the organization. I always had people above me. I always had someone to turn to, to say, I need help, or I'm over my head, or can you check my work? And all of a sudden, I was that person. I was the most senior person. And there is a huge psychological effect because when you come up in this industry, especially if you're lucky enough to work with the kind of people that I've been lucky enough to work with, you are in awe of your peers. You say, like, these people are so smart, they're so passionate, they're so talented. And the idea that I am now the most senior person in this group is difficult for me to fathom. And I don't really feel justified. And you have all these people walking up to your desk and saying, oh, my God, we broke something. What should we do? And you're like, wow, that's really profound. Like, I am now that person that I used to rely on for help. So I definitely think that there is that element. And I think that it's gotten better. I think that more than ever before, people are allowed to say, I don't know, or I'm not sure. I think that that's more common among DevOps infrastructure type teams than it was 10 years ago. But it's still nerve-wracking. And I say to people all the time, the first time that you're a tech lead is really scary. The first time you're a tech lead and something goes wrong is really an eye-opening experience for you. You're like, oh my God, like I'm in the hot seat. I have to come up with a solution. And I tell people, it's just never, you'll get better at it, but it never really gets less stressful.

Bart: Yeah, all that sounds fair. And at the same time, I do like the point you mentioned about younger generations having confidence in cloud providers and knowing that there are resources out there. And it's a question of troubleshooting, asking the right questions. Still, it's something that's stressful. Still something that people need to be aware of, industry-wide.

Mat: You are someone that's very active. What's next for you? Can we expect another excuse for me to put on a tinfoil hat? What's your next project that you got going on? So, I'm working on a few things. I'm hoping that I fell down a rabbit hole a bit of trying to come up with... A common thing that's presented to people is that they say, Kubernetes is too complicated, and I want to know what was the minimal viable product. And so I sort of fell down this rabbit hole of saying, what is the minimal viable product? You say you have a load balancer, you have virtual machines, you have containers running on those virtual machines. What is a software stack that you could scale from one person? So my interior goal was one person to 20 people. And we don't want to grow any more than 20 people. So we're talking about a fixed number of maximum servers. We're really simplifying the problem space. And to sort of come up with a CLI tool that could kick you off. And this was sort of inspired after meeting with a whole bunch of startups who were like, we like ECS, we like all these sort of managed services, but we would love to be able to move them from platform to platform. And what I started to play around with and experiment with, and what we're sort of in the testing phase of now, is to say, like, can we make an open source CLI that just does that for all of the small, medium, large cloud providers? To say, you have a load balancer, you have an IP address, you have an SSL certificate, you have machines, you have containers, they're pointing at some cloud provider. It's very simple and it's very straightforward. And then that's sort of my obsession right now. So we'll see if I can get it to be. Reliable enough to sort of put out into the ecosystem. But that is sort of the dream right now, is to just sort of build something that, if you're a small business and you just want to kickstart and get started today and not be locked in anywhere and then make those hard decisions down the line, that's sort of the vision and the design goal.

Bart: Wow. Sounds great. It sounds like a question that a lot of people have brought up over time: how can this really be simplified? Does it have to be so complicated? I'm a small organization with a small team. I don't have the resources that some of these larger companies have. It sounds like a very positive thing to have going on. And we'll definitely be in touch about how that progresses. What's the best way for people to get in touch with you, Matt?

Mat: Absolutely. So my website is mattduggan.com, M-A-T-T-D-U-G-G-A-N.com. And I'm also on Mastodon. at MattDevDoug, M-A-T-D-E-V-D-O-U-G. I post there all the time. It's often not of high quality, but if you guys want to...

Bart: Listen to me complain about my commute. Feel free to join in. Yeah, good. Well, really nice having you back on the podcast. The first episode was a hit, and I'm sure this one will be too. Definitely now we've established a very high precedent with the tinfoil hat. I think we may have to make that a permanent fixture. And so, you will get credit. This will be the Matt Duggan moment when the tinfoil hat comes out. But yeah, keep doing what you're doing. I hope to see you in KubeCon Paris. That would be the dream. I will let you know. All right. Very good. Well, that being said, Matt, thanks very much. And we'll be talking to you soon.

Mat: All right. Thank you.