Reducing compute capacity by 40% on EKS with Bottlerocket and Karpenter

Oct 10, 2023

Host:

Bart Farrell

Guest:

Gazal Gafoor

Follow Gazal's journey as he shares the lessons learned in adopting, rolling out and scaling EKS clusters at Target Australia over seven years.

You will learn:

What is Bottlerocket OS.
How Bottlerocket helps with securing your workloads.
Karpenter as an alternative to the Cluster Autoscaler.
How Karpenter can efficiently schedule and de-provision workloads.

Gazal hinted at a 40% reduction in compute capacity when combining Bottlerocket OS and Karpenter (and 30% lower response times).

Relevant links

Transcription

Bart: Welcome to KubeFM, the podcast where cloud-native folks come to share their knowledge about Kubernetes, how they've leveled up, and what you should be learning next. In today's episode, I'm joined by Gazal Gafoor, who worked for seven years at Target Australia, helping their e-commerce team as they adopted Kubernetes, specifically EKS. In this episode, we'll talk about the different benefits and the trade-offs of using technologies like Karpenter and BottleRocket OS. Let's take a look at what Gazal had to say about his experience. So, Gazal, welcome to the KubeFM. Just to get started, controversial question, interested in hearing your answer. If you had to take three tools to install on a brand new Kubernetes cluster, which ones would they be and why?

Gazal: All right, thank you, Bart, for having me on your podcast. So yeah, I think it would depend very much on which cloud vendors manage Kubernetes cluster it is. On EKS, at least if I had to pick just three, I think it would be a metrics server, which of course is a prerequisite for scaling our applications, a compute provisioner, and I suppose we all used to go with cluster autoscaler, but now I swear by Karpenter. which we'll get into. And the third would be something for observability, preferably something based on OpenTelemetry.

Bart: And like you said, we'll dig into Karpenter a little bit later on. Now taking it to a little bit more about your background, what do you do and who do you work for at the moment?

Gazal: So I am a lead developer with REA Group, specifically in their privacy team. And I only started recently. And before that I had a long seven-year stint at Target Australia.

Bart: Fantastic. Seven years is no short period of time. I think we'll dig into that a little bit further down the road. In terms of your experience getting into Cloud Native, how did that start?

Gazal: Yeah, I think it started while I was working for an ad techs flash big data company in India called Flytxt, who I believe are now more into AI and ML. We were looking at ways to better manage our microservices. I think at the time, the term microservices wasn't all that popular even, around 2015 or so. And we were looking at solutions like OSGi, but sort of stumbled upon containerization and Docker. So for container orchestration, we considered both Mesos, Apache Mesos, and Kubernetes. Kubernetes was super new then, and I believe we went for Apache Mesos at the time, because it had better support for Apache Spark workloads, which we needed to run. And then later when I started working for Target Australia, we had the opportunity to revisit Kubernetes. And this was, of course, after all the container orchestration was, we're over.

Bart: Very good. Like you said, yeah, there was a contentious time and the things that were influencing those decisions, because I was at a company as well in 2017, 2018, where they were deciding which one to go with. And I remember a coworker went to a conference in Amsterdam about Mesos and there was a lot of passion, a lot of interest. And at that time, like you said, the support being offered for each one was slightly different. In terms of your entryway into Cloud Native, were there Any things that were challenging? How did you go about learning the subject of container orchestration, whether it was Kubernetes or Apache? What was the learning process like there?

Gazal: Most of what I've learned, I think I've learned on the job. Kubernetes was just the right solution to an interesting and challenging technology problem at work. We were looking at both modernizing our application by means of containerization and also uplifting our CI/CD workflow. And we happened to come across the same called Jenkins X, which by the way, does not really have Jenkins, the VM based pipeline scheduler that we all know, but combines tools of like Tekton and other systems inspired by Prow a GitOps operator, features like preview environments and all of that. So it provided an opinionated take on how to build, package and deploy applications on Kubernetes. So it was something of a crash course, like a shortcut into learning more of the Kubernetes ecosystem. So for anyone trying to learn Kubernetes, I would recommend looking at an opinionated system like that. or of course, the Kubernetes Slack workspace..

Bart: With that in mind, make community part of the solution, right? You know, is that if you're coming face to face with these technical challenges, like you said, in a sort of crash course environment, what resources are going to be best? We'll take a little bit of a, we'll take a look later on at the blog that you wrote in terms of channeling that knowledge and sharing it with others. But in your experience, getting into the Slack, asking questions, that was beneficial.

Gazal: Absolutely. Um, and that's something I guess, um, we can all be a little apprehensive about, um, sort of, uh, doubting yourself or, um, things like that. Uh, I, I would just say, um, you know, try to lose those inhibitions and yeah, just be open with the community and, um, you would be able to reap the rewards. I believe.

Bart: Yeah. With your experience now, some years later on, if you could go back and give any advice to your previous self when you were starting out with Kubernetes, what tips might you share?

Gazal: That's super relevant to what we were just talking about. I think as a younger technologist, like I mentioned, I think I was more apprehensive about sharing to the wider tech community, I think perhaps most of us were, um, You know, we would try to find solutions on Stack Overflow, but not necessarily contribute back. So our chat experiences on like stories or blogs, but over the years, I think I've come to realize the value in sharing our experiences with the community. So yeah, that's the thing that I would definitely encourage my younger self to start doing a little bit.

Bart: I think it's a great point. And also because I think as someone who's been involved in communities and managing communities and create, trying to create a welcome and open environment where people feel like there's no such thing as a stupid question or that it's okay to put things out there. I think a lot of engineers, you know, we talk about that, you know, the, the rule that most people in communities will just be lurkers and then, you know, it's a, it's a minority that are actually driving the action. Do you think that it's imposter syndrome that prevents people from asking or that they're worried that a co-worker or a boss might be nearby and would be surprised by a question that they might be putting forward? What are the things that you think are blockers or obstacles to get those questions more out in the open? Or like you said, sharing experiences?

Gazal: I think there's definitely a lot of imposter syndrome. There's a fear that what we might be doing isn't special enough. So that's definitely, well, the thing that I've seen the most. So yeah, just gotta lose it and get on there and, you know, start collaborating with the community.

Bart: Fantastic. And, and like you said, you know, sharing those experiences as a gift to others so they can say, Oh, I'm not alone. Or this person also had a similar problem. So it's a, it's a really good lesson that people can decide how they want to do it. But, but, you know, sharing is caring as the saying goes and, you know, be, be an active part of the community and help drive those things forward. Now to move on to the topic of today, which is a blog that you wrote about the state of EKS clusters in your experience at Target Australia. So you were there for seven years, all right? That must've been a really good time.

Gazal: Yes, it was a really good time. And I learned a lot during my time at Target. The online team was just two agile squads when I started. And it was a great group of like very motivated and talented people. We were doing agile really well, you know, tight standups, effective, discovery of upcoming initiatives sliced in really nicely. We also had a very mature test / behavior driven development practice. I'd started getting used to that while contributing to open source repositories a little before. and had tried to introduce that practice in a previous organization. And then I joined Target and they already had very good rigor around it and a very rational emphasis on quality metrics like code coverage. So that was all like really a great start to it. And then from around 2018 or so, I got involved with a lot of the cloud adoption and related innovation journey with Target. I contributed to Target's way of leveraging the serverless pattern for microservices with AWS Lambda, API Gateway, and associated services. And later... When considering how we could modernize long running application workloads, I was able to contribute to the adoption of Kubernetes, of course. And after the modernization and migration of the whole platform to AWS, we refined it quite a lot. So we were able to improve security, observability and stability of like the overall sort of application platform. And I have written a few blogs about some of it. And yeah, it was a long but fulfilling tenure.

Bart: Sounds like it. And regarding the blogs, we'll be sharing those in the show notes so folks can take a look at that afterwards. You know, 2017 compared to now, you know, if we're talking about adoption of Kubernetes in any form or flavor, what was it like, you know, over the years working with EKS? If we're talking about things in 2017, probably not as smooth as how some things are now. Can you just walk us through that in terms of gaining the confidence regarding the tooling that was provided? And also, you know, you were not alone in this endeavor, bringing a team along with these best practices that you mentioned previously. What was that experience like?

Gazal: I think our adoption of Kubernetes itself preceded the actual EKS adoption. I can't recollect the exact timeframe, but I remember that our first Kubernetes clusters on AWS were using a distribution called KOPS. At the time, the primary goal was to uplift our CI/CD capability using containerized build pipelines And our platform of choice was Jenkins X which I mentioned before and I as I mentioned earlier It's something of a misnomer considering what people think of when they hear Jenkins It's not something Jenkins X has anymore. So I tried to just say JX so JX said tooling and I I think it still does to help even the cluster setup You So it used to be very CLI oriented, but now they have Terraform modules. And when EKS became available in the Sydney region, we immediately switched from KOPS to EKS. And we'd love that we did not have to worry about availability and scalability of control plane components or durability of etcd and all of the pain of just self-hosting the Kubernetes control plane. And I think the, Obviously, the EKS team have been excellent refining the offering over the years. I remember things like workload identity, service accounts, and association to AWS IAM came pretty soon after. But later they made progress with CSI drivers, improvements to their out-of-the-box CNI plugin. I think. Oh, not thing. Recently, they even switched over to start using eBPF for the CNI agent and implementing network policies. So really good progress. And server-side apply controllers or operators for other AWS services. They've got a whole project on that. I think it's called ACK, AWS controllers for Kubernetes. Their Ingress controller is also pretty spectacular now. Um, so yeah, we, we had quite a journey after even the initial sort of, um, EKS adoption, so. you know, shifting from Ingress Nginx to AWS load balancer controller, then uplifting our observability solution. Then around 2022, which is the topic of the blog that you found, is when we started using Bottlerocket OS and we also started using Karpenter.

Bart: Fantastic. So with that in mind, you know, there's like one major transition sort of as an umbrella and then beneath that other transitions that are taking place. Let's focus on Bottlerocket OS. What was uncomfortable or problematic about Amazon Linux that influenced the decision to move to that? Tell me more about that.

Gazal: an AMI based on Amazon Linux 2, which is called EKS, not which was, which is called EKS optimized Amazon Linux was the only option for host OS for worker nodes on EKS initially. I think there was also even initially somewhat unofficial support for Ubuntu based worker nodes. But a general purpose OS for hosting container workloads to me did not make much sense at all. I remember looking at our vulnerability management solution and thinking how do we deal with all these vulnerabilities associated with the version of Python or GCC on the host OS. We don't even need Python there. If our applications need Python, they'll just bring that in their container images. So because we were starting from a not so great starting point in terms of security, we had to bake our own AMI that address for CIS hardening. And then some of those customizations had to be scaled back a bit to let our worker nodes even join the cluster. So yeah, and in technology, it's all about finding the right solutions to problems. So yeah. Yeah, it was probably not the right fit and Bottlerocket was definitely the right solution there.

Bart: You mentioned vulnerabilities, but apart from that, if we're talking about security and then also issues of performance and cost, what did you find with Bottlerocket OS that you hadn't found previously?

Gazal: Yeah, it was a bit of a compounding thing. With Bottlerocket, what we found was What we start off with is that it's very lean in nature, which inherently makes it more secure than general purpose OSes. It doesn't have a package manager, instead opting for an update mechanism involving partition flips. There is no SSH server, instead, Bottlerocket hosts have like a control container that has AWS session manager. does not even have interactive shells like Bourne Shell or Bash or ZSH. Instead, we can interact with the Bottled Rocket APIs from the control container to manage it. So it is quite a paradigm shift. The surface of the attack is simply much thinner and that's just the start. It also has like an immutable root file system backed by DM verify, a memory backed file system for slash etc, all the config, executables built with hardening flags and SE Linux enabled and enforcing mode. So clearly designed with security being a top priority. After we switched over to Bottlerocket, vulnerabilities on the host dropped significantly. And we did expect some performance improvements due to how lean Bottlerocket OS is and some operational efficiencies as a result of that. But the observed improvements definitely exceeded our expectations. We saw as much as 41% reduction in response times for some customer facing endpoints. We believe... Along with the Bottlerocket change, we also changed the EDR security agent that we had, and that might have had a contribution to this outcome. And the improvement in performance and reduction in overheads, all of that also resulted in a 40% reduction in compute capacity requirement. So, money in the bank.

Bart: Absolutely. Money saved is money earned. Another thing that was that, you know, you obviously got to know Bottlerocket quite well. Is this something that you would recommend to all EKS users or are there any sort of trade-offs or perhaps things that wouldn't necessarily fit for every use case?

Gazal: That's a really interesting question. I was... Looking at the state of things on GCP, I believe Google's container optimized OS is the default node or host OS in GKE. So I'm hoping that AWS makes Bottlerocket OS the default for hosts in EKS. The challenges I've seen, you know, from our own experience and heard others from the community mention, somewhat paradoxically have been around security. Some security tooling like EDR agents and such, they require third-party kernel modules, which are not easy to support on Bottlerocket OS. I think container optimized OS, Google's official docs just say, no, can't do at all. Bottled Rocket, I think I've seen discussions around how you might be able to sort of make your own bottle rocket OS with some third party kernel modules, but it's not a path that's easy to pursue. So that could hinder adoption, but thankfully leaders in the cybersecurity space have started developing tooling that leverages eBPF, which of course allows security and networking observability use cases for our process running in the user space. without the need for these custom kernel modules.

Bart: In terms of, you know, on taking the subject of trade-offs further, another trade-off that you were looking at was not getting the results that you wanted from Cluster Autoscaler, which led you to moving over to Karpenter. What, can you tell us just briefly about what Karpenter is and what were some of the factors that influenced that decision?

Gazal: I think I'll address the point of auto scaling itself. When we think or the way that I used to think about it as we have horizontal and vertical pod autoscalers to scale our containerized workloads itself. And then we have cluster autoscaler to scale the compute and infrastructure that powers all of it. And with an EKS, we had node groups, which are essentially nodes ASGs or auto scaling groups of EC2 instances in AWS. And we would start at some minimum capacity and Cluster Autoscaler would interact with these ASGs to scale them as the containers require. But, Cluster autoscaler has little impact on provisioning decisions. It's more of just an autoscaler and not really a provisioner. So decisions like which instance type itself or which availability zone new nodes should be in, the ephemeral storage they have, pricing strategy, all of those are pretty much like ASG config as opposed to something that cluster autoscaler can determine. Karpenter on the other hand is a true node provisioning system. It observes resource requirements within the cluster and provisions and even deprovisions compute as necessary. It even does interruption handling now for spot capacity. And we were using something called AWS Node Termination Handler for interruption handling, but Karpenter has made that obsolete.

Bart: And in a similar way with looking at, you know, things like cost, flexibility, availability, what were some of the things that you found in Karpenter? And if there's anything you'd like to mention as well too about Fargate,

Gazal: yeah. So unlike with Cluster Autoscaler, Karpenter was interacting with EC2 directly. So it can make smart provisioning choices based on the workloads that need the compute. It can identify any AZ affinity that workloads may have. We can have more deterministic spread of the applications across AZs, as opposed to hoping that spread of the host computer infrastructure may translate to spread of the applications. If you have application workloads that need on-demand capacity, we can set that node selection criteria specifically on those workloads, as opposed to having some minimum on-demand base capacity within like a node group. We also leveraged the consolidation feature in Karpenter for more effective bin packing. So all of these choices resulted in reducing the amount of on-demand capacity we were using significantly, driving down costs. And aside from the availability, flexibility, and the cost saving that we had, there was also an uplift in security. to say, I mean, not to say the least. So with ASGs, we used to have to routinely update the whole node group with a newer host OS. With Karpenter, its provisioner CRD allows for a time to live setting. So let's say we set something like 30 days. Any node that has been around for 30 days, it will gracefully terminate. And any newly created instances, they just get like the latest version of the AMI family that we've chosen. And the AMI family that we chose was Bottlerocket. So it always gets the latest version of Bottlerocket available. So that's the way we do it. Bottlerocket does have like an update operator to update Bottlerocket hosts, but we think that the Karpenter approach to have a time to live is much neater. With the update operator, the ASG still spins up Bottlerocket instances as per the version that's already in the ASG spec and then does an in-place update. And oh yeah, you did mention Fargate, thank you for that. For running Karpenter itself, we could use a node group, but we preferred going with a Fargate profile. There's actually even a feature request on the AWS container roadmap to run Karpenter in the control plane. Fingers crossed.

Bart: We'll stay tuned to see how that develops. You obviously got to know Karpenter quite well. Is that something you would recommend to anyone that's using EKS? Would you use it again?

Gazal: Absolutely. Unlike with Bottlerocket, I can't think of anything that hinders adoption of Karpenter. Now, whenever I see someone inquire on the EKS channel about how to solve a problem they encountered with cluster autoscaler like topology and awareness, I just suggest checking out Karpenter. Good to keep in mind.

Bart: With all these transitions going on, as much as we're talking about from a technical perspective and the trade-offs, the different, you know, advantages, disadvantages of each one, Leading a team through this transition, like the one you had at Target, I can imagine is no easy task. People are accustomed to using certain technologies. Some things might be a little bit outdated, getting people up to speed. What was the process of upskilling there in terms of getting everyone on the same page, getting everyone at the same technical level? How did you go about rolling out the implementation and adoption of these technologies? What advice would you give to other folks that are in a similar position?

Gazal: We conducted a lot of knowledge sharing within the organization. Some of us attended, you know, KubeCon at Sydney pre-pandemic and shared our experience with the rest of the team. We also partnered with our AWS account team to get access to relevant training and we also organized some immersion sessions and such.

Bart: Like I said, it's just always important to keep in mind, as much as we're talking about the technical details, as you said in the very beginning, creating a culture where people are comfortable sharing, like you said, going to the Kubernetes Slack is a great resource. Inside each organization, it's a little bit different depending on the culture, depending on the people who work there and their backgrounds. Getting people to willingly admit, I don't have experience with that. What are the best resources for me to go to? As you mentioned, some of those initiatives are quite helpful. Is there one that worked particularly well for you or your team?

Gazal: Like I said, the Kubernetes Slack workspace itself was an excellent resource.

Bart: I guess I'm also just thinking too, as you know, for folks that are working at, folks that are working at a multinational, getting people to go out there and talk about, you know, a technical challenge that they're facing, not wanting to reveal too much, but at the same time, get the kind of guidance that's gonna help them make better decisions. for people that are approaching those sort of issues, both internally within their teams, as well as externally with the broader community. Were there any observations that you had along the way of things that perhaps you didn't expect?

Gazal: Yeah, what I did see was, I think this is where the whole idea of sort of DevOps versus platform engineering comes into play a little bit. There's a certain sense of, you know, you develop it, you own it, but also there are parts of it that a lot of developers would like to see abstracted. So there were quite a few bits of, you know, what seemed like very infrastructure-y to a lot of developers that they didn't, they didn't want to see from an abstract lens. And I think we had the right kind of tooling for it. So that helped in that journey.

Bart: Now, in terms of next steps, what can we expect to see from you next?

Gazal: So as I mentioned earlier, I recently started as a lead developer at REA Group. I'm still quite new here and I'm in awe of the technology practices here. So lots to learn and lots to contribute.

Bart: Good. In terms of the technological practices there, you know, you have a lot of experience with AWS. Are there any plans to incorporate other clouds such as GKE or AKS along the way?

Gazal: Me personally, and I don't think I can speak for the organization. I don't have any particular bias for any cloud vendor. AWS has been the preferred hyperscaler for organizations that I've been a contributor at. I would like to see more cloud services across most cloud vendors having shared APIs like Kubernetes. That's something to hope for.

Bart: And just to double down on that, what would be the primary benefits of that becoming more open?

Gazal: Yeah, see, honestly, I feel like perhaps, the early days of cloud is perhaps similar to the early days of like database adoption. All database vendors had their own sort of APIs, you could say. And then, the community as a whole, had to sort of figure out open standards like SQL, what was that, 97. So yeah, I think Kubernetes is definitely one of those shared open standard APIs that tries to sort of bridge some elemental compute, sorry, elemental cloud services like compute and storage and networking and things like that. But there's obviously a lot more, hoping for more standardization.

Bart: Very, very good. Like you said, it's sort of a, I think it's a nice way of offering a challenge to the community of how can, you know, the best minds get together and provide a framework through which this can become more open in a similar way to what you said about how the database space was in the late 90s. All the late 90s. What a great time to be alive. Good. Now, you worked with quite a few different people at, you know, Target in Australia. getting this off the ground. Is there anyone who you'd like to give a shout out to who you worked with?

Gazal: Okay, to anyone listening who contributed, if I missed mentioning your name, I apologize in advance. So some of the names that come to mind are Adam Cartu. He was our head of customer technology who got the ball rolling and always had our back. Renny Samuel, she was our engineering manager. Paul Thomas, Civilian Architect, I was my predecessor at Target in that role. Paul Cherry, he was project manager for that initiative, the primary sort of migration initiative to AWS. Adil Kumar, he was DevOps, our Cloud Engineer and the Dean.

Bart: Fantastic. Sounds like a great group of people. And like you said, if anybody else out there is listening that wasn't mentioned, just know that you're very much being taken into consideration. Also with the knowledge that you shared in your blog. And with that in mind, what's the best way for folks to get in touch with you if they have any questions about what you talked about today? Sure,

Gazal: should be easy. I have a somewhat unique name. Last I checked, I was the only person with the name Gazal Gafoor, based in Melbourne, if you search on LinkedIn. And I am on the Kubernetes Slack workspace and I just go by my first name, Gazal, there. That's me.

Bart: The beauty of having a unique name is I'm the only Bart Farrell that I know of in Spain so far. So also very easy to find in that regard. Thank you very much for your time today, Gazal. Really enjoyed this. Like we said, we'll be leaving the links in the show notes to the blogs that you've written. Like you said, you're very, very easy to find on the Kubernetes Slack. So thank you very much for your time today. And thank you, Bart, for having me. Pleasure. Cheers.

Listen anywhere

Kubernetes experts reacting to this episode

Kubernetes scaling, upgrades and the future of container orchestration
with Nathan Taber
Virtual clusters: the future of Kubernetes multi-tenancy
with Lukas Gentele
The future of Kubernetes optimization: from cost to sustainability
with James Wilson
The evolution of managed Kubernetes: from infrastructure to workloads
with Gari Singh
Optimizing Kubernetes: from resource allocation to autoscaling
with Nicholas Walker