The Hidden Cost of Slow Autoscaling

May 19, 2026

Host:

Bart Farrell

Guest:

John Ford

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Forced platform migrations are usually treated as something to survive. At Scout24, a mandatory OS migration became an opportunity to rethink Kubernetes autoscaling, node provisioning, and infrastructure efficiency.

John Ford explains how Scout24 moved its EKS-based Infinity platform from a polling autoscaler and over-provisioned capacity to Karpenter and Bottlerocket. The result was faster node startup, a safer migration path, and about a 30% infrastructure reduction without major downtime.

In this interview:

Why two-minute node provisioning forced a 25% capacity buffer
How Karpenter made the Bottlerocket migration safer
What broke around EC2 metadata, AWS SDKs, and cgroups
How the new foundation enables Spot, ARM, and GPU workloads

Listen anywhere

Transcription

Bart Farrell: Forced operating system migration is usually the kind of thing you just try to survive. But at Scout24, it turned into something very different. In this episode of KubeFM, we're digging into how a Kubernetes platform running hundreds of services used that constraint to rethink autoscaling, node provisioning, and cluster efficiency. Moving from a polling-based model with over-provisioned capacity to an event-driven setup with faster node startup and better bin packing. We're talking clusters at real scale, where two minutes to provision a node is already too slow and idle capacity quietly eats your budget. The result? About a 30% reduction in infrastructure without breaking production. John Ford joins us to break down what changed, what broke, and what actually made it work. From Karpenter to Bottlerocket to the trade-offs you can only see once you hit scale. This episode of KubeFM is brought to you by LearnKube. Since 2017, LearnKube has been helping engineers all over the world level up their Kubernetes skills through training courses. They are instructor-led, given online and in-person, to individuals as well as to groups. Courses are 60% practical and 40% theoretical. and students have access to the course material for the rest of their lives. For more information about how you can level up, go to learnkube.com. Now, let's get into the episode. Hi, John. Welcome to KubeFM. What's new for you? What are some technologies that you've been checking out lately?

John Ford: Thanks for having me. We're working on, right now, moving from on-demand with our Kubernetes cluster into using a spot-based instance setup. with many different instance types and across different availability zones. It's our main thing that we're doing right now.

Bart Farrell: All right, great. And for people who don't know you, can you tell us a little bit more about what you do and where you work?

John Ford: Sure. I'm a staff platform engineer at Scout24. We have a few different companies that we own. One of them is ImmobilienScout24. And if you're in Germany, you've definitely heard of this. It's the primary portal for finding places to buy, to rent. We also are expanding into more valuation. And we recently acquired a portal in Spain. I work with the platform engineering team on developer productivity and cloud infrastructure and reliability topics.

Bart Farrell: Very good. And how did you get into cloud native?

John Ford: So before I moved to Germany, I was in the Valley and I was working at a client software company. And we were doing a lot of release engineering topics and mobile operating systems. So I started working on a continuous integration platform, kind of a competitor to Jenkins. And in order to actually have the capacity we needed for that, we needed to start working with cloud vendors because we couldn't purchase the capital necessary for that. So I started working with a lot of the AWS APIs from a more application point of view. And then as I did that, I learned more and more about the cloud native side of things, infrastructure as code. And then when I switched to Scout24, I started working full time on cloud native topics.

Bart Farrell: Very good. Now, cloud native topics, the ecosystem, it moves very quickly. How do you keep up to date? Do you, is it blogs, tutorials, videos? What works best for you?

John Ford: I'd say it's a combination of news sites. So I try to avoid a lot of push type notifications. So I like to go to places like Hacker News, Reddit, different blogs I follow. Also, in particular, a lot of the hyperscaler blogs, I think have a lot of really good information. The open source community has really good places to go as well, between different blogs, primarily there.

Bart Farrell: Excellent. And John, if you could go back in time. and share one career tip with your younger self, what would it be?

John Ford: So I think being able to tell a story about what you're working on is a really important thing for your career development. I think that it's something that I struggled with. I often would focus on just completing the task, getting the technical information done, but then sometimes not necessarily finishing everything up at the end and not really having a story to tell. And so what ends up, I think, happening sometimes is that it's great that you do the technical work, but if you haven't benefited the team, the company, it doesn't necessarily matter. And I think being able to keep that in mind is really important for development.

Bart Farrell: And here we are today with a story that was covered by some of your colleagues titled Infinity Transformation, how we turned a forced OS migration into a 30% infrastructure reduction. So we want to get into this topic a little bit more deeply with some questions. And starting out with the fact that Scout24 is one of the biggest digital platforms in Germany. And before getting too much into the migration, can you give us a picture of what the compute platform looks like? And what are you actually responsible for as the infrastructure team?

John Ford: Sure. So we call it Infinity. It's based on Amazon's EKS. It's our second version. We used to use ECS, but we switched to Kubernetes a couple years ago. We have multiple clusters. I think we're up to maybe seven clusters now. Amongst those, we have about 700 services and over 4,000 containers at peak. That is the majority of our stateless container environment. We do have other compute platforms, but they are not what we recommend for our product developers. The platform team at Scout is responsible for basically making infrastructure easy to use for our product developers, as well as being reliable and observable and have insights into how people are using the platform. Infinity is really key to that because it gives us a standardized way to deploy, to scale, and also to offload the cognitive load of maintaining a complex cluster.

Bart Farrell: Hundreds of teams deployed to this platform every day without worrying about what's underneath. That works until something forces you to change the foundation. And your team was recently in a situation where Amazon gave them a deadline to move off the operating system your nodes were running on. What were the options?

John Ford: We had a few different options. One was to potentially stay on Amazon Linux 2 as our host operating system, which would be insecure. Not a good idea, or potentially look at alternative support options. We looked into maintaining our current architecture with Amazon Linux 2023. There was brief consideration on other operating systems. And we worked with the security team to look into using Bottlerocket as a potential alternative to Amazon Linux 2. But we had a really hard time doing the migration with the architecture that we currently had.

Bart Farrell: Combining migrations sounds like it could go either way. It could simplify things or it could blow up. How did you think about the risk and what convinced you that this was the right call?

John Ford: I tend to agree. I think it's generally bad practice to do too many things in one go. The reason why we wanted to do things together is that The end goal was to replace Bottlerocket, but we found that actually creating a separate cluster autoscaler was actually going to be a very complex task. We would have to duplicate certain infrastructure, we'd be making major changes, and it's going down a pathway we didn't want to necessarily go. So we basically decided with the security team that Bottlerocket was definitely something we wanted to do. So it's definitely a pathway we want to go down. And we saw that Karpenter would be a way for us to actually much quicker and much easier and much more safely make that transition. So we basically broke it down into two separate projects that were part of the same overall deliverable. So we first moved to Karpenter. And then once we had finished Karpenter, we then had the Bottlerocket transition happen. And the Karpenter transition basically fully unlocked our ability to do that migration much more safely.

Bart Farrell: To understand why that sequencing matters, let's go back to how scaling worked before. When a service needs more capacity on your cluster, something has to provision new nodes. How did that work in your previous setup and where did it struggle?

John Ford: The old model was the old cluster autoscaler was doing polling. So it would check periodically and see if there was pending work and then spin up new nodes as necessary. The real problem for us is that it would take up to about two minutes. And when we have as many services as we do, traffic spikes would cause an issue where in that two minute period, the service would become overwhelmed before the new nodes would come up. As well, we basically weren't able to consider using spot instances because the time to spin up a new node was so long.

Bart Farrell: Two minutes, as you said, is a long time when traffic is spiking and pods are sitting there waiting. Most teams find some kind of a workaround for that. What was yours?

John Ford: We basically created a pool of over-provisioned capacity, around about 25% of our pool at any given moment. So these were nodes that we would be paying for 100% of the price that were sitting idle, waiting for load to come. What ended up happening is we were spending about 25% more than we really should have on nodes that were just sitting around doing nothing. But for us, this was an absolute requirement. And it was still cheaper than just buying our peak amount of nodes 24-7.

Bart Farrell: So you're paying for that buffer around the clock, regardless of whether you need it. When you moved to Karpenter, did that change?

John Ford: It did substantially. So we were no longer using a polling model. Everything became event driven. We didn't need to do this over provisioning. We started with Karpenter having the over provisioning still, and then slowly brought it down to reduce the risk of causing issues with scaling. Basically, we were able to very quickly scale nodes up. We also found with the switch to Bottlerocket that we did in parallel with this work, we found that the machines would boot up significantly faster due to the benefits of Bottlerocket. And so this two-minute period turned down to around 30 seconds. And with that amount of time, we were able to safely eliminate the over-provisioning.

Bart Farrell: For the operating system itself, you had a choice between Amazon's general purpose Linux and Bottlerocket. What made you pick one over the other?

John Ford: At the lower level, it starts up faster. It has a much reduced attack surface. Basically, there's less there. There's less to configure. There's no ability to do modifications to the operating system itself. So everything is done in a more immutable way. And as a result, the nodes were able to start up faster, a combination of the Docker image being much smaller, as well as the startup process being much quicker.

Bart Farrell: You mentioned sequencing was critical, doing the autoscaler change before the OS change. Can you walk us through how that actually played out? And what did you do first? And how did the second part build on it?

John Ford: Sure. The Karpenter migration took us about three weeks to do. Why we did Karpenter first before the operating system is, as I mentioned earlier, we previously had a very static node concept. So it was an auto scaling group with just a bunch of EC2 machines. What we would have had to do without Karpenter is essentially duplicate all of that logic, which would have caused us to have to make a lot of changes, not just duplication. We'd have to put different key values. so that we could duplicate the infrastructure. And we felt that at that point, we're going to be doing a lot of really custom work that's completely self-maintained. And basically, we felt that Karpenter was something we wanted to go to anyway. We had a lot of benefits from there. And so combining that and doing the Karpenter work first would allow us to have an easier transition.

Bart Farrell: Now, that is a lot of moving parts at the end of the day. New autoscaler, new operating system, workload shifting between node configurations. With 700 services running on this platform, I imagine not everything went smoothly. So what was the first thing that broke?

John Ford: Probably the metadata service for the EC2 instances. So we were previously relying on the metadata service that's provided by EC2 for credentials, IAM rules, those kind of things. But Karpenter actually disabled that. So what we were finding is that applications that were not updated to work with the newer mechanisms for getting credentials started to get a bunch of errors. We did re-enable the endpoint for a while to figure out how to move forward, but then over time we have disabled that again. And one of the main causes there is that we had a bunch of services using old SDK versions in particular. They didn't work. And the way that we have a shared responsibility model for Infinity, we don't necessarily have a huge view into the software that's running on the cluster. So it was hard for us to know whether there would be services that would have this complexity.

Bart Farrell: So that's a security default in the new autoscaler catching old assumptions in your application code. Once you worked around that, what was the end of it? Or did the new operating system have its own surprises?

John Ford: The difference between Bottlerocket and Amazon Linux 2, there's a new version of cgroups there. As that's exposed through the file system, we were having issues where different file-like objects were not available on Bottlerocket that were expected on Amazon Linux 2 and vice versa. Basically we would check both paths and if we saw the cgroups v2 path we would use that. If we saw the cgroups v1 path, we would use that.

Bart Farrell: Two different issues, two different root causes. One from the autoscaler and one from the operating system. Both of those sound like the kind of thing you'd expect to catch before production. Did your staging environments flag either of them?

John Ford: They didn't, unfortunately. So as I alluded to before, we have the shared responsibility model. So we present a contract to our users and we don't really question them as to what is going to be going on to the service. So when we make a change to that contract, we have to basically find and fix those issues. One thing that made this a little bit more complicated for us and why we didn't catch them at the beginning is that our development cluster for our Infinity product is only testing the development of Infinity itself. So in general, we have an approach where our development environments for platform tools are for developing the platform, not for running the company's development environment. So basically what happens is our customers, they put their development and staging systems on our production environment along with their production services. So it was a lack of visibility into these particular types of applications. And it's also very difficult to deploy the entire set of production services into our development cluster. So I think what we really want to do in the future is rethink how we do the development and staging environment and potentially move more towards a model where our staging environment is actually running the staging environment of our products.

Bart Farrell: After all of this, a new autoscaler, new operating system, two production issues resolved, what did the before and after look like?

John Ford: Node provisioning, as I mentioned, went from two minutes to about 30 seconds. We had a 95th percentile pod scheduling time of about a second and a 99th percentile, so pretty much the worst case scenario of about 27 seconds. We went down in the number of nodes by 30% because we no longer had to deal with over-provisioning. And during the whole migration, we actually didn't have any major downtime.

Bart Farrell: That 30% reduction is striking. Is that purely from removing the buffer or is there something else compacting things further?

John Ford: The bulk was definitely from removing the over-provisioning. So that was about the 30%, sorry, 25% of the reduction. Karpenter's consolidation also helped us by more efficiently packing the containers onto the nodes. And because we have everything stateless in Infinity, we were able to start and stop services without a lot of warning. And that basically lets us to do the consolidation, which gives us better utilization. So that's the extra 5% roughly that we saw.

Bart Farrell: From the developer side, hundreds of teams deploying every day. Did anything change for them?

John Ford: No. The whole point of this was that they didn't have to do anything, provided they were using reasonably recent versions of AWS SDKs and all those kind of tools. But with the over-provisioning, that was completely invisible to our developers. We've not had any issues with that. It didn't cause any downtimes. And for us, a big measure of that is whether we broke the website, and we didn't break the website. So it's a big success there.

Bart Farrell: Congrats on that. One of the interesting things about this project is that it started first as a compliance deadline. You had to migrate, but it ended up being a modernization effort. So what advice would you give to other platform teams out there who are staring at a similar forced migration?

John Ford: We worked with the team that we have internally, our security team who does a lot of the compliance. We took a step back and sort of looked at it from a different angle, not just what is the quickest way to close the security ticket. We looked at what do we want to do with Infinity and can we make it work that we improve security, we improve compliance and also improve future compliance issues. So for us, Karpenter itself does none of those things. What Karpenter did is enable us to do things like move to Bottlerocket. So our security team and our team have been in discussion a little bit about Bottlerocket. And they like the reduced security surface. So we decided in concert with our security team that even if it potentially might be a little bit more complicated, that this was definitely the right way for us to go.

Bart Farrell: And now that you have this new foundation in place, what does it let you do that you couldn't do before?

John Ford: A lot of things. We have different types of nodes that we'd like to use. So our data team would like to be able to do machine learning topics. So that requires GPUs. But as I'm sure everyone's aware, GPU nodes are quite expensive. And so we can basically have a pool of GPU nodes for any workload that has GPUs. We're also looking into using ARM instances for the various benefits. Some workloads are faster in ARM, some workloads are cheaper in ARM. And this lets us have many different types of instances at the same time. And with Karpenter, it's really easy for us to do that. It also unlocked our ability to do spot a lot easier. which is a direction that we would really like to go. We're in the process of tuning our spot usage. But at this point, we've actually switched almost, I think, not quite 100% of our workload in Infinity to spot instances, but a large majority. I think it's above 90%.

Bart Farrell: With that in mind, what's next for you?

John Ford: I think spot is the main thing that we're pushing forward right now. We are thinking about how we want to do purchasing for our EC2 instances. We have considered Fargate in the past. It's something that we're always re-evaluating as things change with Fargate. But basically, Spot is our next main thing that we're working on in this space.

Bart Farrell: And if people want to get in touch with you, what's the best way to do that?

John Ford: I'd say LinkedIn and GitHub are probably the best ways to get in contact with me personally. I'm not a big social media person. but we actually do have some openings for this team. So you can get in contact with us through our jobs portal if that's something that's interesting to you.

Bart Farrell: For folks out there that may be looking for a new role or a new challenge, seems like a good place, very dynamic with lots of interest in new technologies. John, thanks so much for joining us and sharing your knowledge today. Really appreciate it and look forward to hearing about what's going on in Scout24 in the future. Take care.

John Ford: Thanks for having me.

The Hidden Cost of Slow Autoscaling

Relevant links

Transcription