Pod topology spread constraints might not be the best solution

Host:

Bart Farrell

Guest:

Martin Humlund Clausen

This episode is sponsored by Learnk8s — become an expert in Kubernetes

Pod Topology Spread Constraints is a convenient feature to control how pods are spread across your cluster among failure domains such as regions, zones, nodes, etc.

You can also choose the pod distribution (skew), what happens when the constraint is unfulfillable (schedule anyway vs don't) and the interaction with pod affinity and taints.

It's a great and straightforward feature, so what could possibly go wrong?

In this episode of KubeFM, you will follow Martin and his team's journey in discovering and fixing a production incident (on a Friday afternoon) due to a misconfiguration.

You will also learn:

What are Pod Topology Spread Constraints, and how to use them?
How unfulfillable scheduling requirements could lead to un-schedulable pods.
How to detect and alert on unscheduled pods.
How to manage your team during an incident to keep them calm and focused.

Relevant links

Transcription

Bart: Topology constraints in Kubernetes can be tricky, requiring particular configuration and also for organizations moving from VM infrastructure. The things are gonna have to be learning when it comes to infrastructure as code, moving over to a Kubernetes environment, in this case, AKS, in this podcast, where we'll be speaking to Martin, who is a platform team lead at Umbraco in Denmark. Martin shares his experience working as a team leader, still maintaining access and working on the technical side, but also sharing his experience as a manager and what it's like to be building a team that's going to be tackling these challenges. The things that they faced when migrating over to AKS, as well as the successes and the benefits that they've been getting from that change. This podcast is sponsored by Learnk8s. Learnk8s is a training organization that offers classes that are either taught publicly or private instruction. 60% practical, 40% theoretical, very much focusing on the hands-on side of things. Like I said, it can either be done online or in person. You have access to the material for the rest of your life, so you can get all that juicy Kubernetes knowledge that you want to have so much. That being said, let's take a look at the episode. All right, Martin, welcome to the KubeFM podcast. Very nice to have you with us today.

Martin: Thank you very much for having me, Bart.

Bart: So the pleasure. So if you have a new Kubernetes cluster, which three tools would you install?

Martin: That is a good question. I like to run my setups, kind of the, what's it called? No second. So usually we put in, what's it called? Calico, just for some very basic network policies. Nothing fancy setup. And then we, at least where we work, we just use normal Helm charts, no fancy technologies. For the third thing, it might not be a Kubernetes technology, but I'm a huge fan of Terraform and managing my infrastructure that way around. Because it just fits seamlessly together with all the other honor bits of infrastructure that we have in our setup.

Bart: Good to know. Just out of curiosity with Terraform, have you tried OpenTofu yet?

Martin: Nope. I know that it exists. I wanted to give it a look, but I'm kind of busy in my everyday life, and we have a lot of already invested into Terraform. So it's definitely on my radar, but just hadn't had the chance to look at it yet.

Bart: It's all good. Not a problem. In terms of getting to know you a little bit better, can you tell our audience who you are and where you work and what you're doing?

Martin: Yeah. So my name is Martin Clausen. I'm a 37-year-old and I live in Denmark. And I work for this crazy company called Umbraco. And what is Umbraco? It's one of the most friendly companies out there. We do CMS systems, which is actually open source. We have a huge community that helps us build a, I would say, awesome editor experience. And also it very much aligns with how web applications are built on the .NET framework in general. So what is it that I'm doing? Somebody's going to kick my ass, but I'm actually working in the department that tries to make money, which is the cloud department. So we take this box of joy and we spin up the CMS instances on the fly with one click deployments. We do a lot of things on cloud that is not possible natively or not at least without a lot of work involved in that. Yeah, so that is kind of what I do. I am a team lead for two teams. So something we call the Cloud Core, which is basically core functionality. And then also a platform team, which is how we host and run cloud applications within our teams.

Bart: So it's a fair amount of cloud. How did you get into cloud native? Was it through your work? Was it just a side hobby? How did that happen?

Martin: Oh, that is a good question. It's a round of passage, I think. So unbeknownst to me, when I first started out my career in software development, I was super fortunate to get introduced to Event-Driven Architecture. And that's just kind of stuck with me. And then throughout my years as a software developer, I have been just taking along those what's called tools. of the trade and I think it was back in 2016, I happened to get a gig at a startup company who had some incredibly inspiring people and that was actually my first introduction to the cloud. Back then, Microsoft had a technology called Service Fabric, which is also a cluster-like technology, but just mainly focused on the.NET environment. And from there, once you take a step into that world, you find out that there's just a lot of useful tools online that you don't need to hold yourself. And one step takes the other, and then suddenly you are there.

Bart: Now, the Kubernetes and cloud-native ecosystem moves very quickly. How do you stay up-to-date? How do you stay informed? What are the best resources that work for you?

Martin: That's a good question. Interesting question. I try not to get too much work into the ups and downs of what is happening out there. The whole Kubernetes movement is way too fast for me. It feels like all the front-end frameworks that everything just pops up every day. So what I try to do is that I'm hanging around on Day 2. where I just kind of get a sense of what is actually moving in the community. And if there's something that I find interesting, then I basically just Google search, and I find blog posts, KubeFM, and stuff like that that just tries to get an idea that way around.

Bart: And if you could go back in time and give your, you know, when you were starting as an engineer, if you could go back and give, you know, one piece of advice, career advice, what would it be?

Martin: I get to learn people or how people work maybe a bit earlier in my career. It's not always, at least I didn't find it natural in my younger years, but I kind of always seem to like. figure out that that is actually the key to get things working. People are amazing. They are frustrating. To be able to just listen to people more, not trying to be so stubborn and be a bit more pragmatic. I think that would be my take on that.

Bart: I think it's a great point. And, you know, when we think about the amount of work that goes into teaching people or all of us learn, you know, how to speak and how to communicate, a really important part of that is how to listen. And it's not something that's necessarily emphasized. So I think it's a really good point. Good. All right. Now diving into the, you know, what we do for our monthly content discovery, we've got an article that you wrote called AKS from the trenches, why zone topology might not be the best solution. So do you spend, in your job, you know, with managing different teams, do you spend more time writing code or looking after the infrastructure in your day-to-day work?

Martin: Nowadays, it's writing more code. And we had a couple of years back, we actually changed all our infrastructure into infrastructure as code. And that point in time, we already had an established cloud platform. But everything was just point and click. Our business wanted to expand into multiple regions, and that was just not sustainable in the current way that we were doing that. So we took a couple of months out of the book and redid everything in infrastructure as code. And nowadays, it's surprisingly little on how much infrastructure we actually do touch during our daily work.

Bart: And before you started using AKS, what kind of setup did you have and what was the cause of the migration?

Martin: So, at the very beginning from where I started here, we were running like a big monolith, just hosted in Azure, more or less. And a couple of VMs that we was called manually just pitted and made updates to. And back then there was something called a Windows Service, which is basically just a daemon that is running on just a normal Windows computer. And I mean, that was back then. And then we moved into a more cloud-native setup during a couple of rounds. So in the beginning, we wanted to do the whole Kubernetes a couple of years ago, or at least when we started out this journey. But back then, there was also some bureaucracy that needed to be withheld. So we had to... first do all our container applications in windows containers and then host them on a vm and and I mean and then I got the team lead role and then suddenly we had some really capable hands to actually know something about cooperatives and had tried to work with it before and then that just seems like a natural step to Just jump from one platform to the other, like a series of steps in order to get there. And only thing I regret is that we didn't do it faster.

Bart: And in order to do it faster, what would have needed to have happened? And also, you know, in terms of you talked about in the beginning, like the importance of people here, getting stakeholders from upper management involved, what are the things that are necessary? Because a lot of the engineers that, you know, that are that are listening to our podcast and our audience are facing these challenges of I know this is the right technical choice, but I've got to convince a lot of other people that this is the right one to do. And as you said, in an appropriate time in order to get things done as quickly as possible.

Martin: Yeah, so I don't think that's the easy part here. Either you have people who would like to work with you or you don't. For me, it was a matter of what's called the normal equation of time over hands plus priority, meaning that at least in the cloud team, we have never really been that many people. Whenever we, when we chose to do the regional setup that we had via infrastructure of code, that just seems like a perfect time to introduce Kubernetes as well. Luckily, at that time, we didn't have too many naysayers or too much, what's it called, friction to actually get done. And what happened was that in the end, people actually didn't care, which is super weird when you just get things right the first time. So we have not only my team, but we also have other teams who are relying on us running a couple of their services. And at that point in time, we just had the time to kind of... I guess, handheld people into delivering the tools and the continuous deployment pipelines to them actually not care about the underlying hosting platform at which the application were running. So removing that, let's say, friction of them having to know the technology and we can just take care of it was actually a huge help for us. They didn't have to do anything to change their normal day routine. and it's all for the better. And if you ask business people, would you like to have redundancy and resilience, then there's only one right question, and that is, yes, of course, we would like to have that. So a matter of, I guess, timing and as little friction for other teams as possible.

Bart: Right. Now, speaking of friction, is sailing on AKS always smooth?

Martin: Oh, that was leading. Not always. I mean, mostly we had an incident in June that took us a bit by surprise. But oh boy, if it wasn't the consequences of our own choices back then.

Bart: But I mean, you know, of course, you only plan outages or it's great to have incidents and, you know, in summer at 4 p.m. on a Friday, right? So what exactly happened?

Martin: So maybe I need to tee up a little. Like, so what we allow other teams to do is basically just spin up new services and then just push them directly into our cluster. And that means that. memory and CPU usage is actually very much of a concern to us. And especially if we know when people are pushing new software. into the cluster. So let's say the big issue was that at some point a new service was pushed into our cluster and little been known to us because we weren't really, what's it called, aware that that we were strained on memory already. We saw some intermittent failures in production.

Bart: And just for the sake of context as well, what are pod topology spread constraints and why would you need them? What's their purpose?

Martin: Good question. So a little bit about the topology constraints. So in a data center, it's not just one brick. of a data center that you just host your cluster in. But more often than not, some data centers have actually been parted into multiple zones. So in order to have high resilience, you would like to apply a zone topology to ensure that your applications is running multiple zones in multiple instances. And the thing about these zones are that they usually come with their own power supply or network backbone. So meaning that in case of natural catastrophe, or if there's a data center outage, you will actually be saved by this zone topology or this thread topology in most cases.

Bart: And how can pod topology spread constraints make a node run out of memory? And if so, how do you fix it?

Martin: um yeah so this was a bit uh the issues that we had we a lot we didn't knew that that was a problem so so so a little for our setup was that we have basically told kubernetes that that uh please run these three api instances in three different zones and no matter what don't change the the don't don't don't let's say balance parts in other zones than we actually have set. So at some point, a new service came into our cluster. It's not only APIs that we host, but in this case, we were hosting a worker service, let's say in zone two. And then we How we, you know, like the way that we found out that we actually had an issue, do you know, like that feeling when the head, your hair starts being raking on your necks, but you're not entirely sure what it is. So, so, so during the day, our team has been complaining, not complaining, but highlighting not to us, but internally in within our cloud group that, hey, something seems a bit off. And what had happened was that some team had pushed in a new service, and that just happened to make a node in our cluster in one zone run out of memory. And because the way that we had defined our topology constraints, we will continuously try to just reschedule new pods onto that node. And for the sake, Kubernetes just did what we have told it, which is basically just keep trying until something was afloat. And then I guess it was a little late on the Friday, you know, like that sudden realization that, hey, this is maybe something that is wrong with our hosting platform. And lo and behold, it was. So we kind of. not kind of screwed up, but we were a bit baffled on why this was the behavior that we were seeing, and we didn't really able to correlate what it was.

Bart: Yeah, and at any point, was increasing the number of nodes, was that considered?

Martin: Yeah, so actually, that was our first thought, was that, well, how hard can this be? We're just going to increase our node count by one. and so what happened here was that when we increased the no count by one we didn't see that the parts were out of balance throughout the cluster which was our expectation in the beginning but the new node that we created that was created on zone number one but it was actually shown in node two which were in zone two which was out of memory So the topology constraint that we had applied here basically just didn't work, more or less. Because I just really wanted people to go on weekend, we bumped the amount of nodes to six nodes, which would make sufficient room in the cluster for applications to run smoothly again. But it was kind of a journey to get to there.

Bart: Still, everyone was able to leave on Friday in June, which I think is quite important for everybody. Once again, the people factor. Is this something that you think you could have predicted? And what would you have done differently in hindsight?

Martin: To be honest, I don't think we could have predicted this. This is just something that unless you have experience with these topology constraints. Like you can sit around and you can talk for ages about the choices that you make. But unless that you investigate each of your choices, you try to go to the logical conclusions of each. Then as a team who's jumping onto Kubernetes in the beginning, I don't think that this is something that you would be aware of. At least without rigorous testing and multiple, let's say, third party consultants and stuff like that.

Bart: And based on that, do you think that pod topology spread constraints should be used sparingly or just not at all?

Martin: Well, it depends. In the beginning, if you're just getting onboarded to Kubernetes, I would probably suggest just to keep it very simple. Don't jump on the topology constraint in the beginning. and then get out and ship a working cluster with working software on it. And once you are a bit more confident, I would easily recommend going into these topologies and just to make sure that you have some sort of resilience in in place in case of data center breakdowns or something like that. So start out easy and then go into it. That is probably my recommendation.

Bart: And in terms of avoiding situations like this in the future, do you have any plans or best practices in order for those things not to happen?

Martin: Yeah, so our teams, two times a week, we are running something that is called the DevOps Stand-Up. where we just look at basically just look at dashboards and application logs and just trying to see if something is admissible. Back then, we had been running Kubernetes for a year without any challenges. So at that point in time, we decided to include Kubernetes into this DevOps stand-up just to keep an eye on what it actually did and whether or not the cluster was healthy. Currently, we are looking into finding out some alerts on Metfix. So we should be alerted if we see high restarts, counterparts, or other, let's say, intermediate failures.

Bart: You know, it's on a Friday. Getting close to 4pm, people want to go home, do fun things in the summer. You know, it stays very light in Denmark until late. So that's something to enjoy. What was the atmosphere like during the incident?

Martin: It was pretty good. Like, I really enjoyed my team. And they also, like, do what they do. So I think it was okay. Of course, people are a bit, you know, like... hesitant to go home. It wasn't like I kept everybody on board. Just like I think we were two or three people just trying to hack our way out of it. And I think that after an hour, maybe one and a half, we were actually done.

Bart: You seem like quite a calm person. Not everybody has that sort of demeanor, way of looking at things. How do you keep calm in those situations? And how do you keep your team calm? Because for some people, it can be very stressful.

Martin: One thing that I really try to not do is panic. Because I rather want it to be a more creative process. Like when we are creative and we are curious, we have a lot more headroom for problem solving. and the way that that that i just try to communicate that to my team is that that if if we see that things are getting uh you know tense it's okay for us just to step outside for for a couple of minutes and just take a deep breather and just say hey what is our options in this situation. And the more that I can contribute to a calm environment, the better that I... Like it rubs off on people. And also the people that I have on my team, I completely trust them. I know that they are super skilled. And I know that they are doing the very best that they can. So me, I mean, we'll try to keep things at light as possible throughout the incidents. And of course, there are people externally who are worried about, like, when is it going to happen and stuff like that. But we just say that we are trying our best. And then... make sure that the creative processes for problem solving are highlighted in due time. And then usually we always do a retrospective on it. And if some people have some, let's say, concerns about what if it happens again and stuff like that, we are taking actions on what is it that we can do in order not to get into this situation again. But, you know, sometimes... You're just out of luck.

Bart: Absolutely. I think we can all agree. With the element, though, of, like you said, giving creativity and space for that. And then also the other things in there too, including doing a retro. One thing that I find that organizations struggle a lot with is that as much as the energy level might be really high, there may not be a common understanding of what the objective is. How do you go about creating common goals, making sure that everyone's really aligned and on the same page so that they're all working towards the same thing and that there's not one person or in some cases, groups of people that might be going in one direction and others in another?

Martin: Oh, wow. That is a very, very important question. For me, it's about creating the environment. If the environment is right, and we have some basic rules that everybody follows, is that we listen and we are attentive to everybody's thoughts. As a leader, it's my job to kind of set the destination where we are going, but not necessarily how we are getting there. I can set the boundaries from what parameters that we are working into, but if I start yelling at people or telling me exactly how to write that code, the first thing I know is that people will just leave my team. So for me, it's all about, of course, setting the clear goals of what is it that we want to achieve, and then setting the environment. And then I'll just try to be supportive of whatever and wherever the team is going to go. And of course, I'm also a software developer. I also have opinions and I also have ideas. But in the end, I... I really try to take my own ego out of it and let the best ideas win. And it's okay to disagree sometimes. That is actually encouraged. But as long as we are getting forth all the ideas that work towards that common goal and we are achieving that goal, then I'm a happy camper.

Bart: And another thing to be a happy camper, I think for a lot of folks out there is, and you know, this podcast will be released in early 2024. So it's a time where people might have a lot of goals and talk about this year. I want to have more work-life balance or this year, you know, I want to exercise more in your case. One of the things that you seem to be interested in is, is precisely that right. Is exercise and having this kind of balance. Can you talk about that? And, and also the nature of. hard work when doing physical activities and how that might relate to work-related activities as you can't really achieve results just kind of sitting and trying to float by. So tell me about your experience in that regard.

Martin: That is kind of my life philosophy. Hopefully, I could spend hours talking about it. In general, I feel like in order to have a perfectly healthy mind, you also need a healthy body. and and those two are just connected in my opinion and you can't live life in your head sometimes you need to get out and like and do physical activities to try to Trust me, we are trying to reach the mental barriers each and every day. But that also needs to be true for, I guess, physical activity. Work-life balance is a thing. You're not a workhorse. People are not a workhorse. Like if you truly do what you enjoy in life, like being together with friends and family, that will give you the, I guess, energy and motivation to also excel in your work. And that's just how I kind of think about it. So I try as soon as it's four o'clock. For me, I have my share amount of working home hours and additional work hours. But at least as I'm getting a bit older, I'm starting to appreciate other things in life as well. And if it's reading or exercising, then I'm just all for it.

Bart: Fantastic. I think it's very important for, it's something that we hear a lot, but I don't think you can hear it too many times and from too many different angles. And so to hear it directly and that it's not to say that everyone has to do ultra marathons or do this or that, but find something that works for you. It could be going for a walk. It could be swimming. It could be a million different things, but something that gets you away from screens, that gives you time to yourself and to be able to disconnect from work-related things or reconnect. Sports can also be a social activity. So I really, really agree with that. So what's next for you? Can we expect an article in the future about EKS from the trenches or what are you going to be working on next?

Martin: I well we hadn't had many issues with the Kubernetes for a long time so I'm not sure that I can do more AKS from the trenches. Hopefully not. So for me I would like to go out and do more live presentations actually. We have some local communities where I really want to go out and speak to other people and be more interested in. And technology-wise, I think there's a lot of things going on on Azure, which is where I had my home for quite a while. with abstractions on top of Kubernetes, where you don't even need to know that Kubernetes is running down there. And so all sorts of new hosting platforms or continuous services or something like that. So still in the game and still just trying to figure out where this crazy journey ends.

Bart: And if people want to get in touch with you, what's the best way to do it?

Martin: I'm not a social media kind of guy, but I do run a little site blog on dev.to, where you can find me at Martin H.C. Or else I'm also on LinkedIn, where you can find me as well.

Bart: Very good. Well, Martin, thank you so much for your time today. Really enjoyed the conversation, both from the technical perspective and also on the human side. I think you really bring a lot to the game there. So keep up the amazing work and I hope our paths cross soon.

Martin: Thank you, Bart.

Bart: Cheers.

Martin: Cheers.

Listen anywhere

Kubernetes experts reacting to this episode

Kubernetes evolution: Platform engineering and serverless future
with Jason (Jay) Smith
Mastering Kubernetes: from troubleshooting to simplicity
with Billy Thompson
Managing Kubernetes infrastructure: from bare metal to production
with Michael Wells
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes
with Roland Barcia