Configuring requests & limits with the HPA at scale
This episode is sponsored by VictoriaMetrics - request a free trial for VictoriaMetrics enterprise today.
Alexandre Souza, a senior platform engineer at Getir, shares his expertise in managing large-scale environments and configuring requests, limits, and autoscaling.
He explores the challenges of over-provisioning and under-provisioning and discusses strategies for optimizing resource allocation using tools like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA).
You will learn:
How to set appropriate resource requests and limits to balance application performance and cost-efficiency in large-scale Kubernetes environments.
Strategies for implementing and configuring Horizontal Pod Autoscaler (HPA), including scaling policies and behavior management.
The differences between CPU and memory management in Kubernetes and their impact on workload performance.
Techniques for leveraging tools like KubeCost and StormForge to automate resource optimization.
Relevant links
Transcription
Bart: In this episode of KubeFM, we're joined by Alexandre Souza, who will dive deep into the challenge of configuring Resource requests and limits in large-scale clusters, offering expert advice on avoiding over-provisioning and under-provisioning. He'll explore how tools like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) can help scale resources efficiently while discussing the differences in managing CPU and memory in Kubernetes. We'll also cover autoscaling tools like ArgoCD and FluxCD, and the role of automation in simplifying Kubernetes management. Whether you're scaling clusters or optimizing performance, this episode is definitely for you, so check it out. This episode of KubeFM was sponsored by Victoria Metrics. Try Victoria Metrics Enterprise for free today. With our Kubernetes operator, deploying a simple, scalable, open-source monitoring system is effortless. Experience advanced enterprise features such as downsampling, enhanced security, and automated backups. Alexandre, can you tell me about the three emerging Kubernetes tools you're keeping an eye on?
Alex: For me, one of the most important tools I've been keeping an eye on and getting up to date with is Karpenter. Karpenter is a tool that manages the lifecycle of clusters and nodes, helping you scale up and down as your workload demands, and can also help you save money on infrastructure.
Another tool is Krew plugins. Krew is a tool within kubectl that allows you to install other helpful tools. They have over 260 different tools, some of which overlap, but some nice ones to keep in mind are kubectl, kubens, which helps you switch namespaces while debugging or troubleshooting, and kube-resource-capacity, which gives you insights into the capacity within the cluster. For example, if you don't have kubectl installed, kube-resource-capacity can be very helpful.
Two other important tools are related to the same topic: GitOps. One is called FluxCD and the other is ArgoCD. They basically do the same thing, although there are some different features. They're really nice when you want to bring GitOps to your organization.
Additionally, Kubent is a Kubernetes troubleshooting tool that allows you to inspect the APIs within your Kubernetes to verify what's going to become outdated in case you're migrating Kubernetes from one version to another. It's not part of kubectl plugins, but you can install it separately.
Bart: Out of the two, do you have a preference, FluxCD versus ArgoCD?
Alex: I prefer FluxCD because it has a Terraform operator, so I don't need to use another tool to deal with Terraform. I know that ArgoCD and FluxCD were initially created for Kubernetes deployment and management of Kubernetes objects, but FluxCD comes with Terraform integration, which allows provisioning of any other infrastructure on a cloud provider. ArgoCD, I'm not sure, I heard they were working on something similar, but so far, it's not available. Currently, I use ArgoCD at my company, Getir, but we have to use another tool to deal with Terraform.
Bart: Could you mention your company? Can you tell me more about what you do and who you work for?
Alex: I work for a company called Getir. Getir is a Turkish company that provides a quick delivery service in Europe, delivering groceries within 20 minutes. In Turkey, they are a larger company, often referred to as a super app. In Turkey, you can rent a car, apply for jobs, do quick grocery shopping, or opt for longer-term deliveries that take a day or two. They also offer Getir Food, a service similar to Just Eat or Uber Eats.
Bart: And what's your role inside the company?
Alex: I'm a senior platform engineer. As a platform engineer, we look after the whole infrastructure, managing Kubernetes and services within Kubernetes. We provide support for more than 1,000 developers and bring automation to them, implementing Terraform or other types of automation to remove manual work from the process. We also do some SRE, monitoring all the services and providing automation for TeamGarden, onboarding them in our monitoring systems and so on.
Bart: Fantastic. How did you get into Cloud Native?
Alex: Yes, nice question. I got into Cloud Native when I started my journey with Kubernetes. I focused on Kubernetes, studying and preparing for the Kubernetes Developer certification, which is focused on application development. This led me to explore the world of Kubernetes and bring it to the company I was working at the time. I started getting involved and using other tools to support the company's needs, such as cert-manager to acquire TLS certificates with an A+ grade for free, and Ingress controllers to deal with ingresses and expose our services to the public internet.
Bart: And what were you before cloud native?
Alex: Before Cloud Native, I was mostly a software engineer, focusing on core development, specifically .NET and C#. I was creating applications and product development applications for the companies I was working at the time.
Bart: Now, the Kubernetes ecosystem moves very quickly. What are your go-to resources - blogs, YouTube videos, podcasts? What do you prefer?
Alex: I prefer a bit of everything. I try to get things from Pluralsight, for example, or search platforms where I can get some video classes. I also go through the Linux Foundation, applying for certificates or at least their trainings, and I read articles a lot.
Bart: If you could go back in time and give your previous self any kind of career advice, what would it be?
Alex: I think I would have taught myself to learn a different language at the time. Learning and improving in other languages can open your mind to adapt and change in the current IT landscape we live in. I would go back and tell myself to grasp different languages, get very good at them, and go from there.
Bart: Now, as part of our monthly content discovery, we found this article, "The Challenges of Configuring Kubernetes Resources, Requests and Limits, and HPAs at Scale." The following questions will help us dive into this topic a little bit deeper. Before we get into the crux of today's episode, can you tell us a little bit about the cluster configuration at Getir?
Alex: Yes. In Getir, we have two main environments. We have a development environment that contains about 146 namespaces, which can vary. It's not a very polished environment because teams can add different namespaces, so we don't have much governance there. The workload also varies. Right now, I would say we have about 2,200 plus workloads. Our second biggest environment is production, with about 46 namespaces. Those are the most important. The workloads there are lower, of course, because it's a production environment, so there's not much research or trial and error being done there. We have around 797 workloads, and with replication, that can go to more than 8-9k pods at a specific moment in time.
Bart: And if we're thinking about infrastructure, what are the biggest challenges for infrastructure of this size?
Alex: One of the biggest challenges is to keep the cluster always on to ensure it is fully operational and avoid outages. When I say fully operational, I mean that it has to be with all the services that are running there, not the product development applications, because this is the responsibility of the tribes and squads that tackle specific needs of the business. Instead, it's about ensuring the cluster is there for them to use all the time, and also managing the stakeholders' expectations regarding cost, utilization, and rollout of new features.
Bart: And how do you keep over-provisioning under control?
Alex: Over-provisioning is a very interesting topic, and that's the reason I actually wrote the article. As part of working on Getir, I had to focus on an initiative of cost reduction. One way to achieve this was to examine our Kubernetes cluster and ensure we make the best use of our resources, mostly memory and CPU. To do this, we have to be conservative and aware of how we consume them, and how we request them.
A quick way to achieve this is to ensure your Resource requests and limits have conservative values in terms of the actual needs of your application. To learn this, we can use Kubernetes monitoring tools to identify how much our application actually uses, and we can try to improve on that. When I refer to requests, I'm mostly talking about the request itself, because the request is what Kubernetes uses to schedule a pod on a specific node. Additionally, the request is also the chargeable unit within your cloud provider, whether it's GCP, Azure, AWS, or another provider. Therefore, we have to be very conservative in how we choose those metrics and values for them.
Bart: And in terms of advice that you would provide, how does someone set Resource requests and limits to avoid over-provisioning? Should they underestimate the CPU and memory they need for their workloads? What are the consequences of doing so?
Alex: That's a good question, Bert, but it's a very difficult one to be honest. It all depends on many variables. In a small organization with a specific number of machines, a known amount of CPU, and a known amount of memory, it would be a little bit easier to deal with than in a highly distributed, ever-scaling environment. For example, at Getir, we have thousands of pods running concurrently, and on a good day, we can have about 500 nodes, 500 VMs running to support the workload. So, managing resources in that specific environment is way harder than in a smaller, more static environment in terms of scalability for the nodes.
Kubernetes always schedules your pods based on the Resource requests and limits you set. In a small environment, setting lower requests is okay because, as long as you don't have a limit or have a large amount of limit for that specific resource, your application will work nicely, and you'll get the resources you need without over-provisioning. In a large application or cluster, being very conservative on requests can be painful because you don't control the amount of workloads or pods that will be deployed. Other teams are also deploying at the same time, and requesting very little can eventually cause Pod eviction of your pods because the node that the pod was scheduled on initially is running out of memory or CPU, which is being consumed by other workloads or pods from other teams.
Under-provisioning also brings consequences, such as affecting your application's performance, causing performance degradation, and increased downtime. Resource contention can occur, where other workloads fight for the same resource, creating a bottleneck, and scaling challenges with your workload. It's difficult to say what the best approach is, but there are ways to overcome these challenges, which we can discuss further.
Bart: And if someone doesn't want to go over or under the provision, let's imagine they find the right value. Isn't that outdated as soon as there is a code change?
Alex: It depends. Not necessarily. Code changes don't necessarily influence how an application performs or how many resources it consumes, but they might. The best example is a bug that deploys a change in the system, causing the system to consume too much memory. This could be due to a garbage collection in Kubernetes not cleaning up long-lived objects, which can happen depending on the specifics of the language. For example, in C#, I could have this problem where long-lived objects would be promoted from one stage of memory to another, consuming more memory.
The best approach is to configure a good amount of resource requests and limits, being conservative, and depending on the criticality of the application being deployed to the cluster. Then, you can apply other things on top of that to help scale resources as needed. One option could be using a Horizontal Pod Autoscaler (HPA) to scale out. This would allow a single pod to consume the same amount of resources, but then you have a second pod, doubling the amount of workload that can be processed.
Another option is a Vertical Pod Autoscaler (VPA), which adds resources vertically. VPA is not part of Kubernetes, but rather an external controller that can be installed within your Kubernetes cluster. It increases the amount of resources, such as memory, CPU, or GPU, as needed. Additionally, if you're using Kubernetes 1.27 or later, there's a feature called in-place resizing, which is a policy that tells Kubernetes how to deal with changes in resource requests.
Normally, prior to this feature, any change in resource requests would cause the pod eviction, and another one would start in its place with the new value. However, with in-place resizing, you can let Kubernetes know that you don't want to restart the pod. The VPA can make the change, for example, from 300 megabytes to 400 megabytes, and the pod would stay there without causing any downtime or slowness due to replication.
Of course, this takes into consideration how you do replication, and you can put policies in place to only allow a specific amount of pods to go down while being replaced by a new one. It's very tricky, so we have to keep in mind what's best for each specific case and try to understand what works best.
Bart: Is there any way that people can be more proactive with their requests? Would your advice be to change and reevaluate them frequently?
Alex: There are many ways to manage resource allocation, but it also has drawbacks. You can do this manually by using tools like KubeCost or resource allocation tools to evaluate the amount of resources you're consuming. Based on that, you can make changes manually or automatically using tools like the autoscaler or Vertical Pod Autoscaler (VPA) with in-place resizing. Alternatively, you can use tools to automate this process, so you don't have to worry about doing it yourself or enforcing all teams to create Horizontal Pod Autoscaler (HPA) or VPAs.
This is especially important in large organizations, where asking development teams to create HPAs or VPAs might be difficult, and they might not respect or follow the process. Having an automated mechanism is better than trying to enforce a process. Tools like KubeCost Enterprise Edition or StormForge can help with this. They can analyze your loads and make changes on the fly, automatically.
However, there are some drawbacks. For example, if you have ArgoCD in your environment, which is looking after deployment of your applications using the GitOps model, it can cause issues. ArgoCD does reconciliation, so when an automated tool makes changes in the actual cluster, it can cause a drift from the actual GitOps principle, where your source code is the source of truth. This can lead to counterproductive results, where you're trying to solve one problem but creating another.
To mitigate this, you can disable ArgoCD's reconciliation for specific workloads or critical applications, and let an automated mechanism handle the rest.
Bart: Do these tools only help with Resource requests and limits and requests?
Alex: KubeCost isn't really focused on that. KubeCost is focusing on other things like disk utilization, GPU utilization, and helping you adjust those. StormForge, at the time I got in contact with them, was mostly focused on CPU and memory. I'm not sure if this has progressed since then, but there are other tools in the market that can help you with the same thing, and they might have different focuses.
Bart: So, with that in mind, are CPU requests and limits different from memory? Are there any key considerations people should keep in mind related to this particular aspect?
Alex: It differs a bit in how Kubernetes deals with requests and limits. CPU requests and memory requests are basically the same and are used to identify where a specific pod is going to go, which node has enough resources to allow it. However, in terms of limits, Kubernetes uses the Completely Fair Scheduler (CFS) to enforce CPU limits. This can lead to CPU utilization being throttled. What this means is that if you have a service that spins up 10 threads to process concurrently, those 10 threads will be split, sharing the allocated CPU. For example, if you have a 1 CPU limit and create 10 threads, each thread will consume 0.1 units of CPU per second. Each thread will be processed one after the other, and the first thread will only be processed again after the 10th thread has finished. This can result in performance degradation, affecting user experience, increasing latency, and reducing efficiency.
Bart: So, could someone use KubeCost and StormForge to estimate the right value and amend their deployments? What's not clear perhaps to some folks is just how quickly you can do that. Is there a better way to manage those requests?
Alex: Yes, there are better ways to do it. We highlighted them a little bit in the beginning when we were talking about pod scaler. Of course, because there is never one way of doing things. StormForge, for example, is really nice to help automate things, but it might not have an ideal response time for your needs. StormForge uses machine learning to analyze metrics and processes them every 15 seconds or so, so it might take a little bit of time to apply a change. Implementing a Horizontal Pod Autoscaler (HPA) in your cluster or even a Vertical Pod Autoscaler (VPA) can help. The HPA controls how your application scales horizontally, and you can set several different flavors of configuration. The basic ones would be resource-based, such as memory, CPU, and GPU. You need to find what works best for you. You can also come up with custom metrics to scale, but we can probably deep dive into custom metrics later.
Bart: With that in mind, how do you set the Horizontal Pod Autoscaler (HPA)? What metrics should it track to scale the number of replicas?
Alex: Here we go. You can send your Horizontal Pod Autoscaler (HPA) many different ways, as I mentioned previously. Once we're talking about resource as a type or metric, we can set that specific resource, such as memory, CPU, and GPU, to use a specific measure to allow the HPA to scale up and down. Normally, there are three types: utilization, which deals with average utilization; average value, which is the average value of the amount of resource you ask for; and a more static type measure, which is a specific value. You can say, for example, "I want my HPA to scale every time the pod reaches 300 megabytes or 400 megabytes, or whatever works best for you, depending on your load." You can specify a specific static value, and then the pods will start scaling accordingly.
Of course, there are some other things to consider, such as how the pods scale. HPA brings one thing that affects scaling behavior, so you can tell how you want to scale that workload. For example, let's say we have a provisioned pod or a workload, and we say the memory request is 200 megabytes. You create an HPA for that workload, scale that specific workload, and say you want to scale that at 7%. Once that specific pod's runtime utilization memory utilization reaches 7% - in our case, 210 megabytes - that pod is going to scale. You'll have a second pod, and the load will be distributed between those two pods. If you continue to achieve that 7% of the whole, your application will keep scaling.
You can implement scaling behavior policies to help manage the behavior. There are two policies you can configure: how you want to scale up and how you want to scale down. For example, do you want to scale down, or do you want to prevent scaling down? Coming back to the amount of memory requested and the percentage you ask the HPA to scale, and how this influences over-provisioning, I think the best approach is to always keep in mind that you should do some profiling before you deploy. Do some memory profiling, CPU profiling, and maybe in your environment or an endeavor environment, where you can learn how much data your application consumes at one time. Based on that, you can initially configure the Resource requests and limits of the application to a specific, more conservative amount. You can also tell HPA to scale later. You can tell HPA to scale not below 100% if you're talking about average utilization; it can be 100% or more. You can let your application scale a little bit later, depending on the criticality and quality of the application. You can configure HPA scaling behavior to help you as much as possible in terms of trying to avoid over-provisioning and overpaying for resources.
Bart: And how does the Horizontal Pod Autoscaler (HPA) help with overprovisioned replicated workloads?
Alex: If you set a conservative initial request, the Horizontal Pod Autoscaler (HPA) might reduce the amount of overprovisioned resources you request. As I mentioned earlier, you can configure it to scale more than 100 percent in case of average utilization or go to average value as well. This can help you configure a better amount.
For example, earlier I gave an example of 300 megabyte pods with an average of 7%, but let's change this to average value. Initially, your pods consume approximately 300 megabytes when your packages are running, so it's okay initially, and you're not over-provisioned. However, as it goes, it fails. You can configure this average value to be a little bit more. Let's say you don't have limits, but your application has some bit 300 and you don't have a limit, and you can tell the Horizontal Pod Autoscaler (HPA) to scale the application when starting again maybe at 400 megabytes or 500 megabytes.
This way, you're a little bit closer to the initial need of the application, and you don't over-provision that much. We can also use values which can help you come up with that's not upper edge anymore. The value itself is a value per pod, so every time that the specific pod gets to that specific amount, let's say at 400 megabytes, an initial pod of 300 megabytes configuring for HPA to have 400 megabytes, every time that each pod gets to that specific amount, it gets a new one spun up as long as it respects the maximum value of the HPA.
This gives you a little bit more control on the amount you want the HPA to trigger and how much you want it. For how long you want to let the application scale and grow internally to resolve the memory that's consuming until you get a new pod to process that request. Of course, you can again configure the HPA scaling behavior and scaling policies on the HPA to improve how you get to help you in how you scale up or how you scale down or how things keep going after specific open pods.
Bart: But isn't scaling on memory and CPU always a bit fiddly? Let's consider the following scenario. You have one pod with a memory request of one gigabyte and an average utilization of 70%. So a new pod is created when the actual memory usage is more than 700 megabytes. Imagine you drive some traffic for the pod. The memory reaches 800 megabytes and the Horizontal Pod Autoscaler (HPA) triggers the auto-scaling and adds one more pod. The traffic is now split into two pods instead of one. The actual memory consumption is likely to drop as well. If it drops below 700 megabytes, the second pod will be removed. This will make the single pod receive all the traffic, increase the memory, and trigger the auto-scaler again. How do you solve this?
Alex: I think the best approach for that is to have a scale behavior policy because you can tell the Horizontal Pod Autoscaler (HPA) not to degrade. There is a maximum amount of time that you can respect, that is 30 minutes. You can tell the Horizontal Pod Autoscaler (HPA) not to scale down for up to 30 minutes, or you can tell a percentage of how much it can scale down. In this case, giving another simple example, one pod that scales to a second one, increasing the amount of time that the second one could be alive might help you out on not scaling the second one down immediately. But if you keep getting more pods, you can let the Horizontal Pod Autoscaler (HPA) know that you want to adjust a percentage of them to scale down. Even if the average utilization is way below, and you reach four pods, but the average utilization is just enough to support maybe two pods, those two other pods that are alive might eventually be evicted. With Horizontal Pod Autoscaler (HPA), you can control a little bit better how it's going to scale down by specifying a percentage of pods that can be evicted, for example, only 10% or 20% of those pods can be evicted. This approach also works for scaling up. So, utilization policies in Horizontal Pod Autoscaler (HPA) can help you solve this issue, specifically by using HPA scaling behavior.
Bart: Until now, we've discussed the technicalities of setting Resource requests and limits, and the Horizontal Pod Autoscaler (HPA). But imagine you've also faced another significant challenge, which would be adoption. How did you convince the rest of the developers to inspect and use the right requests and limits? What are the incentives, if any?
Alex: Adoption is always a complicated matter. It depends on the sense of use created by the teams, stakeholders, and organizations, as well as financial challenges. For example, I faced this challenge when I had to discuss a reduction initiative and engage with teams to reduce costs.
I think the best approach is to start small, create a coalition or find advocates who share the same idea as you. They likely face the same problem and want to be more sustainable in how they spend their budgets on cloud resources. Finding these individuals or teams is very important at the beginning. Creating a sense of purpose to tackle this challenge and face the problem as soon as possible is also crucial. Bringing well-documented pages and processes can help teams adopt the change more quickly.
I always start with one team, one tribe, and then expand from there. This approach is more manual, as it involves encouraging developers to use Horizontal Pod Autoscaler (HPA) or Vertical Pod Autoscaler (VPA) better or figure out Resource requests and limits. However, if you want to automate this process, the same steps still apply. You need to get the automated tool approved, and some tools come with costs, so you also need to manage the budget. For example, StormForge charges for the amount of resources it brings down per CPU, and even KubeCost's private version has costs. If you want to implement something to address this challenge, you'll face the same adoption and implementation challenges. I think using Kotter's steps to change model can help a lot with this initiative.
Bart: If you were to do it all over again, what would you do differently?
Alex: I think I would spend more time on automation. We spend a lot of time trying to get people on board with improving and educating them on what Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are, and how to control them in Kubernetes. Documenting this process is important, but automating it from the beginning might be the best option. You can even create your own controller to do this. If you have enough time, you could install it in your Kubernetes cluster and let it manage things based on your needs. All the cool metric services are there for you to consume, and you can know about your cluster and the amount of resources being consumed. Automation, I think, is the best way to go. I would focus more on developing an automated tool for this.
Bart: Just to wrap up with our last few questions, you're the founder of TechMeOut. What is it, and how did you come up with the idea?
Alex: I'm one of the founders. TechMeOut.io is basically two things, but always with the same goal: to improve the recruitment process. We have a free version that allows users to build their CV, provides highlights on how to improve their CVs, and gives scores on how good their CV is. We also provide AI tools to improve writing. The ultimate outcome is having a rich CV that can be presented to organizations when trying to find a job. Our initial focus was on tech professionals, but we're expanding that.
The second part of TechMeOut is the B2B part, which is for recruiters. Recruiters can manage the whole hiring process, invite candidates, collaborate with them to improve their CVs using our AI tools, and offer the best candidates to higher management in companies. The main goal is to provide the best chance for candidates to get the job.
This idea was initially conceived by my friend Gabriele, who experienced many issues while hiring people at eBay, such as non-standardized CVs and poor writing quality. We came up with the idea to create tools that can standardize the look and feel of CVs, as well as the quality of writing, to give candidates a better chance.
Bart: What's next for you?
Alex: Well, I guess now it's a focus on Kubernetes for sure, but not only that - focus on continuous delivery, DevOps practices, and trying to bring better ideas on how to improve the development lifecycle process in companies I've been working for, and improve organizational performance as well with those tools.
Bart: And if people want to get in touch with you, what's the best way to do it?
Alex: You can get in touch with me through TechMeOut.io, where it's a very simple way to contact me. Just mention that you heard me on this podcast and I'll get back to you. Alternatively, you can also reach out through my bio site, bio.site/alex. As I use that daily, you can find my LinkedIn profile and X account through that as well.
Bart: Fantastic. Alex, thank you for your time and for sharing your knowledge today. I look forward to crossing paths with you in the future.
Alex: Thank you very much. I hope I answered your questions.
Bart: Very good job. You did a very good job on lots of details. I'm sure our listeners will love it. Thank you.
Alex: Thank you very much.