Faster EKS Node and Pod Startup

Feb 17, 2026

Host:

Bart Farrell

Guest:

Jan Ludvik

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Kubernetes nodes on EKS can take over a minute to become ready, and pods often wait even longer — but most teams never look into why.

Jan Ludvik, Senior Staff Reliability Engineer at Outreach, shares how he cut node startup from 65 to 45 seconds and reduced P90 pod startup by 30 seconds across ~1,000 nodes — by tackling overlooked defaults and EBS bottlenecks.

In this episode:

Why Kubelet's serial image pull default quietly blocks pod startup, and how parallel pulls fix it
How EBS lazy loading can silently negate image caching in AMIs — and the critical path workaround
A Lambda-based automation that temporarily boosts EBS throughput during startup, then reverts to save cost
The kubelet metrics and logs that expose pod and node startup latenc,y most teams never monitor

Every second saved translates to faster scaling, lower AWS bills, and better end-user experience.

Relevant links

Transcription

Bart Farrell: Why do Kubernetes nodes and pods sometimes take far longer to start than they should? Today on KubeFM, we're joined by Jan. In this episode, Jan walks us through a real optimization effort focused on node and pod startup performance across large EKS clusters running at scale. We look at what actually drives startup latency, image pulling behavior, EBS throughput limits, and why defaults like serial image pulls quietly slow clusters down. Jan explains the changes that made a measurable difference, including parallel image pulls, selective image caching and automated throughput tuning to balance performance and cost. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through Kubernetes courses. Courses are instructor-led and are 60% practical and 40% theoretical. Students have access to course materials for the rest of their lives. They are given in-person and online to groups as well as individuals. For more information about how you can level up, go to learnkube.com. Now, let's get into the episode with Jan. Welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Jan Ludvik: One of them is sidecarless service meshes. I think those are great technologies. The other one is Bottlerocket, which we recently implemented. And I really like the concept and how it works. Like having only what you need for running containers. And the third one is Karpenter. I always liked the auto-scaling and the concepts behind it. So we also started using that recently and that would be the third one.

Bart Farrell: Okay, great. And for people who don't know you, can you tell us about what you do and where you work?

Jan Ludvik: So I work at Outreach. We are a revenue workflow platform which is targeted for seller customers. I'm site reliability engineer here for four and a half years. And we use mostly AWS, Kubernetes, Terraform, and also an observability stack.

Bart Farrell: And how did you get into Cloud Native?

Jan Ludvik: Okay, so I started as a network engineer. And then I moved into consulting a little bit but infrastructure consulting. So I worked in on-premise data centers and I really liked AWS. I switched to some AWS projects while I was in consulting and after that I switched career paths into cloud native and engineering.

Bart Farrell: Okay. And you mentioned this a little bit, but what were you before cloud native?

Jan Ludvik: Yep. So I have a degree in telecommunications and I was first a Cisco network engineer. So I worked with all the switches and routers. So then I moved to the very different set of technologies in on-premise data centers, but I didn't stay there for a long time because I really liked AWS or cloud, which I chose AWS from all the main competitors and I moved to engineering career path I worked at AWS at first so it was easier for me to get into and I started picking up more and more the additional technologies that you need for current infrastructure so Kubernetes is the main one and terraform and the other the usual stack.

Bart Farrell: Okay. And how do you keep up to date with the Kubernetes and Cloud Native ecosystem? Things move very quickly. What resources work best for you?

Jan Ludvik: Well, recently I started using AI or LLMs because I think they work well as better search engines. So I use those trying to find what other people are using. And I'm hoping it'll go through all the resources that I don't have time to go through. But I also try to read information in various Slack channels like Kubernetes Slack, Istio Slack and a couple more. And also, when I'm working on something, I'm trying to read, let's say GitHub issues, or some relevant GitHub issues on some projects that I'm currently using. And sometimes I'm just finding some things that other people are using, and I'm reading on those and I go from there.

Bart Farrell: And if you had to go back in time and give yourself one piece of career advice, what would it be?

Jan Ludvik: Yeah, probably that consulting is not for me. I think it's a very specific world and I found myself much better in engineering.

Bart Farrell: Now, as part of our monthly content discovery, we found an article that you wrote titled Optimizing Node and Pod Startup Performance. So we want to dig into this topic a little bit further. Let's start with some context about outreach and your infrastructure setup. Can you tell us about the platform and the scale you're operating at?

Jan Ludvik: So mostly we run AWS. We have around, I think, 10 production clusters running somewhere near 1,000 bigger nodes. We use EKS. We are on 1.32 at the moment. We recently moved to Bottlerocket, as I mentioned, and we use Karpenter as our autoscaling tool. So that's the high level.

Bart Farrell: And many teams struggle with slow Kubernetes node and pod startup times. What specific pain points were you seeing that prompted this optimization effort?

Jan Ludvik: So this was more like my own project. It wasn't driven from the top, from the business. But my thinking was that if our customers need to perform some tasks in our platform, in our product, every second we can get rid of in our node or pod startup times is beneficial for them because they will see better performance so this was the main theoretical idea I wasn't able to find any internal metrics for our platform latency but this was the main thing. My other reason was that AWS bills those resources per second. So I was thinking every second we don't need to have these nodes available or online and we don't have to pay for them we can save money by not paying for them. And that was the third reason.

Bart Farrell: Now before diving into improvements measuring performance is crucial. What metrics and tools did you use to understand your baseline

Jan Ludvik: I realized I really don't have any good sources of information about how our nodes or pods, how long they are starting. So I found after some research, because it's not that well documented, that Kubelet exposes metrics for pod startups, node startups, and image pulls. And also there is one other solution and Karpenter has kind of the same metric for pod startups. So I use kubelet's metrics port to get the metrics.

Bart Farrell: Kubelet logs seem to provide particularly detailed information. What insights can teams gain from analyzing these logs that they might miss from metrics alone?

Jan Ludvik: Kubelet has the information on every pod and it records information about when the pod started, when it started pulling image and when it was fully running. So you can find a very detailed information on each pod, but these are not available easily in metrics because the cardinality would be very high. So if you are able to get all the logs out and somehow parse them, get metrics from them, that would be a valuable source of information. There is also information in the logs that's called pod start SLO duration and pod start end-to-end duration. And that means that pod start end-to-end duration is a full pod startup. And the SLO duration is the same, but it doesn't include the image pull. So it could be useful for developers that can't really influence how long pod image is pulled. But they want to work on optimizing their pod startup time from a runtime perspective.

Bart Farrell: Got it. So. Image caching in AMIs is a common optimization strategy. You discovered an interesting problem with EBS lazy loading. Can you explain what happened with your other 900 megabyte image?

Jan Ludvik: Yeah, so my first thought was that we will just pull the biggest image that we have, that we use, and we will just save like 20 to 25 seconds from the pod startup. We don't have to pull it. But when I did this, I... often found in the logs that I mentioned before that this actually didn't help and the pod was still... pulling the image as before. So I was doing some research on this and I found some information that sometimes the EBS might not have the data when you need it because it's lazily loaded on the background and only will start requesting data when you actually ask for it. So I realized that this might not help. This will not be the solution to get the 25 seconds of the pod startup.

Bart Farrell: After the EBS lazy loading issue, you pivoted to a critical path strategy. What images did you focus on and why did this approach work better?

Jan Ludvik: Yep, so I thought that if I can't get the biggest image to be cached successfully, I wanted to try pulling and caching smaller images. So images for daemon sets like kube-proxy, vpc-cni, and node-local-dns that we are using. So I cached that into the image and actually that seemed much better and we saw on average a 10 second decrease, 10 second reduction in pod startup times.

Bart Farrell: Kubernetes pulls container images serially by default, which can create bottlenecks. How did you configure parallel image pulls and what should teams watch out for?

Jan Ludvik: One of my biggest surprises was that Kubelet by default pulls images serially. So the default setting is to not pull them in parallel. So there's a flag called serializeImagePulls and you have to set it to false otherwise it's by default true. And there's also setting to limit the maximum number of images being pulled in parallel. And you can just set whatever you want to make sure that the nodes are not overloaded by too many images being pulled at the same time. So for example this image is useful if you have one huge image. The Kubelet will start pulling it first and then there might be 10 other tiny images in queue but they will have to wait. So for this situation this setting can help. And it also helped us with reducing the image pull times... sorry, node startup times because I didn't look into detail into this, but I can imagine that there was a big image that came from Kubernetes before some other smaller images that were needed to finish the node boot and they slowed down the node startup time.

Bart Farrell: EBS volume throughput limits can significantly impact startup performance. Can you walk us through your analysis and the throughput levels you tested?

Jan Ludvik: So by default we use gp3 EBS volumes and they have 125 megabytes per second throughput. So I tried setting this higher and it actually has a good impact on node boot times, pod startup times. It also affects some other things like extracting images that are being pulled. So at first I tried setting it to 300 megabytes per second, that helped a lot. So I thought how far can I go. So I tried setting it to even I think 1000 but I saw that it didn't help. This such a higher value didn't help as much so I went down a little bit. So in the end I ended up at 600 megabytes per second. The reason also for that was that when you look through the instances and their throughput limits So each instance has some limit that it can handle. And usually 600 megabytes per second seemed like the sweet spot for most of the instances that we use. So this might be a higher number for someone who's using the biggest instances, might be a lower number for someone who's using smaller instances.

Bart Farrell: Increasing EBS throughput adds cost, but you only need it during startup. Can you explain the automated solution you built to optimize this?

Jan Ludvik: Each time the instance is started, there's an event in CloudWatch, EventBridge, which is now called EventBridge. And the event records that the instance was started. So I built a small solution that the event will trigger. It will go through SQS queue and then it will go to Lambda. where Lambda actually changes the EBS throughput back to their original value, basically the one that's the free one. I mean, the same one which is in the default configuration for the volume.

Bart Farrell: The automation system using CloudWatch, SQS, and Lambda is elegant. What were the key implementation details that made this work reliably?

Jan Ludvik: Yep. So one of them is SQS delay queue. So I was trying to find out how to delay the message for several minutes because I wanted to give the instance some time to stabilize, let all the pods start. So I settled at 10 minutes and SQS can delay the messages by I think maximum to 12 minutes, maybe actually 15, but I think we are right now at 10 minutes. So the message is delayed and after that it's sent to lambda as a trigger and lambda uses some tags to filter down to basically check the instance that emitted this start. event is the one that should be modified. And if all the tags are correct, it'll go to the volume and modify the EBS volume, which can be done without any interruption to the volume itself or to the instance itself. However, this is true in our case, maybe for someone who's using who has some really light and sensitive volumes. This might not be completely true, but I can't really... talk about it because in our environment this is fine.

Bart Farrell: Okay and let's talk about the actual results. What improvements did you see across your various metrics after implementing all these optimizations?

Jan Ludvik: On average the node startup was improved by 20 seconds so we went from 65 to 45 seconds. It also helped with the P90 for POD. So the P90 was reduced by roughly 30 seconds from 80 seconds to 50 seconds. And also the average pod startup time was improved by 10 to 15 seconds. So the biggest effect was on the kubelet node startup duration seconds metric, which is the full node startup. And we also saw improvements in image pulls. That's all.

Bart Farrell: Okay. Performance improvements often come with trade-offs. What limitations or considerations should teams be aware of when implementing these optimizations?

Jan Ludvik: Yep. So as I said, I think one of them would be possible, negative effects on the EBS volume, which I can't really say if, and what they are, as I said, not really a concern in our case, but for someone else. might be visible. There might be some visible performance degradation while the volume modification is in flight. There's also the added cost for the increased throughput, which with the lambda that we talked about is not as high. In our case, it's acceptable for the result it delivers. And actually right now the biggest problem, well not the problem, but it added overhead to our team where we need to match the contents of the image or the image snapshot. The images need to match the images that we deploy on the cluster. We right now don't have an automated solution for getting the same image inside the AMI or the snapshot. So this is a little bit operation overhead for us right now.

Bart Farrell: And for teams out there that are running similar EKS setups with Karpenter, which optimization would you recommend starting with for the biggest impact?

Jan Ludvik: It seems like enabling the parallel pulling is the best way to go, since it's just one flag and then if you want, you can tune the maximum images being put in parallel. But this is the easiest way how to start pods faster. This will not help with the node startup that much, but I think in the end, we care more about the service pods and application pods over the nodes.

Bart Farrell: And looking back at this optimization journey, what surprised you most, or what lessons learned would you share with other infrastructure teams?

Jan Ludvik: So what surprised me the most was that the images are not pulled in parallel, they are pulled in a serial way and that's the default setting in Kubelet and you have to actually go and change it to be parallel The other thing was that actually the Kubelet will not prioritize the pods in any way, as I found. It seems like basically the only way how to control what Kubelet will pull and start is by time when it's recorded in Kubelet or by the time it comes to Kubelet. So those two things were the biggest surprises.

Bart Farrell: What's next for you?

Jan Ludvik: Right now we are in progress of working on the Istio ambient mode. So that's a really difficult topic, I would say. It's very complicated, but at the same time, very amazing technology. I'm still surprised how great it is, but also terrified by how complicated it is. So I'm looking into adding Argo Workflows tool into our tool stack because it sounds really interesting for things like running one-off scripts, doing some tasks that need to be done and not running them from our own laptops. And on the other hand, easier than just writing some arbitrary pods for each task. So it just seems like a very consistent way to do things.

Bart Farrell: Okay. And if people want to get in touch with you, what's the best way to do that?

Jan Ludvik: I think the best way is through LinkedIn. I'm usually there reading all the updates.

Bart Farrell: Great. Well, thanks so much for sharing your experience with us today. Look forward to interacting with you in the future. Take care.

Jan Ludvik: Thank you.

Listen anywhere