Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling

Sep 9, 2025

Host:

Bart Farrell

Guest:

Jorrick Stempher

This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io

Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.

Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.

You will learn:

How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patterns
The Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisions
Real-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutions

Relevant links

Transcription

Bart: In this episode of Kube FM, I'm joined by Jorrick Stempher from Infotopics. We dig into his article about scaling Kubernetes clusters across multiple environments and what it really takes to keep them secure and reliable. Jorrick explains how his team is using GitOps workflows to manage complex infrastructure, the guardrails they've built for compliance, and the automation that keeps operations predictable as things grow. It's a practical look at running Kubernetes at enterprise scale without losing sight of developer speed or system safety. He would also like to give a shout out to some of his colleagues, but I'll let him do that because my Dutch is terrible. Dank u wel and much respect to all the folks out there in Holland.

Special thanks to Testkube for sponsoring today's episode. Are flaky tests slowing your Kubernetes deployments? Testkube powers continuous cloud native testing directly inside your Kubernetes cloud. Seamlessly integrate your existing testing tools, scale effortlessly and release software faster and safer than ever. Stop chasing bugs in production. Start preventing them with Testkube. Check it out at testkube.io.

Now, let's get into the episode. Jorrick, welcome to CubeFM. For people who don't know you, can you just give us a quick overview about what you do and who you work for?

Jorrick: I'm Jorrick Stempher. I'm currently 22 years old and a student at Windesheim in Zwolle, Netherlands. I'll graduate next September or July/August, depending on my progress. I'm currently a junior software engineer at Infotopics, which is based in Heidelberg. I write software for Power BI and Tableau extensions, which is really interesting.

Bart: And how did you get into Cloud Native?

Jorrick: This project is what we're talking about today. It's actually the only time I've worked with anything Cloud Native, and it's only for school.

Bart: And how's that experience been? How did you get into it? What was the learning process like?

Jorrick: It was quite tough because this project was the first time I ever did anything with Kubernetes. I had some Docker experience, but not much. So it was actually really exciting because we had to run the project and learn Kubernetes at the same time. That was really fun.

Bart: And in terms of learning Kubernetes, were there any resources, blogs, tutorials, or anything like that you've checked out that have been helpful for you to learn?

Jorrick: The project manager provided us with a course that explained Kubernetes tools. I tried to look it up but couldn't find it because it was on my old school email. However, the course was really helpful for the entire project team, as we worked with seven other students.

Bart: Okay. And if you could go back in time and share one career tip, technical tip with your younger self, although I understand you're quite young, what would it be and why?

Jorrick: Getting out of the comfort zone when it comes to projects and programming is totally cool, and I should do it more often.

Bart: Very good. Now, as part of our monthly content discovery, we found an article you wrote titled "Scaling Smarter: Predictive Node Management with Prometheus, Next.js and Prophet". Your team just completed a three-part research series on Kubernetes scaling optimization. Can you give us an overview of what problem you are trying to solve and how this final piece fits into the bigger picture?

Jorrick: In the first week of the project, which was a 10-week school assignment, we received a task from our supervisor Ernst Bolt. He wanted us to create a scalable Kubernetes cluster hosted on Vultr. The challenge was to develop a cluster that could scale predictively, which is different from reactive scaling. Reactive scaling starts when nodes are already incredibly busy, whereas predictive scaling anticipates and prepares for load before it occurs.

Our team of eight software engineers, including myself, had minimal cloud native experience. Some of us, myself included, had limited Docker knowledge but no Kubernetes expertise. While working on the project, we simultaneously learned how Kubernetes works.

We conducted two research streams: cloud compute and scaling optimization. The cloud compute research focused on selecting the correct virtual machine hardware configurations for our frontend and backend. The automation research centered on choosing a predictive model and investigating node startup times.

By the project's end, we published two to three medium-length blogs: two about our research and one about the scaling service.

Bart: From your previous research, you discovered that a 1 to 1 vCPU to instance ratio was optimal for resource allocation. How did this finding influence the design of your scaling service?

Jorrick: The finding directly shaped the design because it was a default scaling rule. During the research phase, we tested virtual machine hardware and found that running multiple instances of a Next.js client as a front-end on the same virtual CPU led to inefficient utilization. Basically, we had limited load capacity and random performance drops because Next.js doesn't really support multi-threading on the client-side. When running two front-end instances on one CPU, we had a really low throughput, almost the same rate as one instance. So we decided that for one virtual CPU, we would run one front-end instance. This was crucial to understand before starting work on the scaling service.

Bart: You also researched node startup times in part two and found they averaged around eight to nine minutes. How did this timing constraint affect your prediction model's design?

Jorrick: We needed to predict far enough into the future to have enough time to call the Vultr API, start the virtual machine, which took about eight and a half minutes because it was the first boot. It needed to start up, install Debian, run on it, and be added to the cluster. This process took about nine minutes. So we had to predict far enough ahead to have enough time to start it up, add it, run the front-end and back-end, and complete everything in time.

Bart: Now let's dive into the architecture. You built a five-component system to handle predictive scaling. Can you walk us through how these components work together?

Jorrick: We created five components, combined as the scaling service. These components were a direct result of the two researchers and the information discovered during the project.

The first component is Prometheus, which collects metrics from the entire cluster. We collected primarily CPU metrics. This data then goes to the next component, an adapter—a microservice built in SJS that calls the Prometheus API to retrieve the data. We clean the data and send it to a Postgres database.

The Prophet model runs at set intervals (10 minutes, an hour, two hours, two days) and queries the saved data from the Postgres database. It creates a prediction one hour into the future, generating 60 CPU load predictions (one for each minute).

The results are saved in Postgres and posted using a custom API to the fourth component, the Index Calculation Component (ICC). The ICC receives a request from the prediction model and calculates an abstract index metric to determine whether to scale up, down, or maintain the current state.

The fifth component is Grafana, which queries Postgres to display all the information on a web page. We can compare predictions to real-time data, showing prediction-only views, MAPE, CPU load, and other metrics to demonstrate the system's functionality.

Bart: The Node Ranking Index, or NRI, seems to be a key innovation here. What exactly is this metric for people who may not be familiar with it, and why did you need to create it?

Jorrick: We found out in week five, halfway through the research, that we needed a unified metric to combine CPU, RAM, and request data. This was really essential because it allowed us to directly compare the resources. That wouldn't be possible otherwise because CPU and RAM are completely different types of measurements. Without this ranking index—just a number like one, two and a half, or three—it would have been really difficult to scale nodes up and down for cost and performance efficiently.

Bart: You chose Prophet over other prediction models like ARIMA, SARIMA, and NeuralProphet. What made Prophet the best choice for Kubernetes workload prediction?

Jorrick: We needed a good balance between accuracy and training speed. Another key reason was the built-in seasonality, which was very important since the cluster load we were working with changes in predictable patterns throughout the day, week, and months. Because it had all these things, it made it really easy to create reliable predictions for scaling decisions.

Bart: The NestJS Adapter does more than just pass data along. It validates metrics before storage. What kind of validation is needed, and why is this important for prediction accuracy?

Jorrick: When the adapter microservices calls the Prometheus API from Prometheus, it checks all the metrics before saving them to Postgres. We found this was very important using Grafana, because sometimes Prometheus gave us invalid data, like the CPU metric being zero or null. This could potentially mess up predictions from Prophet, not because one random number matters significantly, but more importantly, it looked ugly on Grafana—a nice flowing line suddenly interrupted by a random spike going up or down. So we decided that if a metric is really weird, we simply don't save it and move on to the next one.

Bart: You deployed everything on a dedicated manager node with taints and tolerations to control pod placement and ensure specific workload management on the node.

Jorrick: We discussed with the product owner Ernst, and he wanted the entire service to run on the same cluster which we scale. We needed to run it while adding and removing nodes. But what happens if we remove the wrong node where the entire service is located? Then nothing works anymore. So we had to use taints and tolerations to make sure it couldn't accidentally remove itself, because then it wouldn't work anymore.

Bart: Your Grafana dashboard shows both real-time and predicted NRI values. How accurate were the predictions and what was the typical MAPE you achieved?

Jorrick: It really depended on how much data the adapter saved in the database, because more data would lead to better predictions. Since we always tested in a testing environment with our own load generating script, it's hard to say how accurate the prediction would be in a real-world scenario. We didn't have the time or resources to create truly accurate forecasts, as it takes considerable effort.

Our load generating script fluctuated day to day and weekly—on weekends we had less traffic, while Monday had high traffic and Tuesday had less, and this pattern would switch each week. However, this was not nearly as accurate as real-time traffic. The typical Mean Absolute Percentage Error (MAPE) we achieved was below 9%, which was quite good, and the lowest we ever got was 0.5%. Unfortunately, I only have screenshots left, which is a shame, as I wanted to have the actual data. I believe our error rate was between 5% and 9%.

Bart: Every few minutes, the Prophet model gets retrained on new data. How did you determine this retraining frequency, and what happens during the retraining period?

Jorrick: We only had 10 weeks, which included building a front end, back end, doing research, and building the service. We didn't have much time to train Prophet. We rushed through the first few weeks and then trained Prophet during the two-week Christmas vacation, which was not part of the 10-week project.

We created a script that essentially acts like a small DDoS, sending many packets to generate enough of a pattern for training. We retrained the model every 15 minutes, looking back at the previous week's data and comparing it to the current data. Most of the time, it was quite accurate.

We ended up deleting the entire training data from the Christmas vacation because we made a mistake. However, we trained it quite accurately in two or three days, which also depended on our script's index calculation component.

Bart: I apologize, but the provided transcript snippet appears to be incomplete or missing. Could you please provide the full transcript or the surrounding context so I can properly analyze and apply the hyperlinking guidelines?

Jorrick: The ICC was the most complicated component because the process itself is quite complex. When Prophet provides a prediction, it retrieves all running nodes on the cluster to understand its current capabilities. It creates a node ranking index as a single number, such as 3.

The next step calculates the necessary index by using the Prophet prediction and saving the highest number projected in the coming 15 minutes. Using that value, it calculates a new index and compares the current and future indices. Then it uses a recursive calculation to find the best combination of node plans needed to reach the required capacity.

This approach ensures no duplications or unusual combinations are created. It also prioritizes removing nodes with the earliest billing cycle, since virtual machines are billed hourly. After completing this complex process, it calls the Vultr API to add a node with four virtual CPUs and remove another with two virtual CPUs.

Bart: And then during testing, you used a load testing script to simulate realistic server usage. How did you ensure your simulated load represented real-world patterns?

Jorrick: We used virtual users to generate loads. Each virtual user makes a certain number of requests depending on the number of endpoints. The API we created had, for example, five endpoints, and each virtual user sent one request to each endpoint every second.

We had to keep in mind the startup times because when we started sending load to the client or API services, the load immediately went over the threshold we set—95% successful requests within a specific time limit. This happened because we went from zero load to, let's say, 5000 requests per second, which the CPU just couldn't handle. It failed the requests, and the test ended immediately since it was automated.

To address this, we added a ramp-up time of about five seconds. The duration doesn't really matter as long as it remains consistent during each test. This helped us test the nodes properly because many crashed instantly but worked fine with the ramp-up. It's also unrealistic to go from zero to 10,000 or 15,000 requests per second instantly.

The use case was a school-related project for a client API running on the cluster. We calculated usage patterns, anticipating that during weekends, fewer students would use the platform. We expected the platform to be busiest on Monday and Thursday at Windesheim.

An interesting challenge was load generation. The entire project was run in a classroom with multiple monitors, and one laptop couldn't generate enough load to hit the threshold. We tested eight to twelve different virtual machine hardware combinations. The school Wi-Fi limited the number of packages we could send per device, so we used three or four laptops running the script simultaneously to test our virtual machine.

We had pre-checked this approach with the school and Vultr, ensuring we weren't attempting to disrupt school systems—though it might have been fun.

Bart: Normally, that is what people are trying to do. I'm glad you chose the path less traveled. Looking at your results, you achieved accurate predictions, efficient resource usage, and dynamic scaling. Were there any unexpected findings or challenges you encountered during implementation?

Jorrick: We were all software engineers who didn't know about Cloud Native before starting this project. This made it difficult to understand what Ernst, our product owner, wanted. We had moments where we thought Vultr already offered this system, but it was reactive, not predictive.

We only had 10 weeks to complete everything. We had to learn Kubernetes, build a content management system with a client and API—essentially a small YouTube studio-like system. We created this to have something to test. We were also balancing school, working only four days a week on Mondays.

Fortunately, we had a great team. We all knew each other well, having worked together as teaching assistants, which made the process enjoyable. The additional tools like Helm, Prometheus, and Grafana were completely new to us. We had a tight timeline with stressful moments, but also many good ones.

The best moment was when we figured out the NRI and had a working calculation component. We celebrated with cake or ice cream.

Bart: So for teams looking to implement similar predictive scaling solutions, what advice would you give based on your experience? What would you do differently if you had to start over?

Jorrick: I would select a team full of Kubernetes masters. People who actually know what Kubernetes is capable of because we figured it out in a few weeks, and I think we did quite a good job. It would make the entire process much easier, and you could probably do more in 10 weeks.

Also, do your research. We worked on ours for two to three weeks, which was really a lot for a 10-week project. If we had skipped over some parts of the research or cut some corners, I think the scaling service would have never become so accurate and robust as it did in the end.

To be honest, I can't really say what I would do if I were to do the project over again. As team lead, I was really happy with the entire process we had. The final project was really good. I think we got like an 8.5, which was quite high. We were quite disappointed, but it was a good grade, and we're really happy. So I don't think I would do anything different. But that's difficult to say because now I know how it works.

Bart: Fair enough. This was clearly a team effort with eight developers working together. Can you tell us about the collaboration and give a shout out to your colleagues and mentors who made this project possible?

Note: While the text doesn't contain any specific technical terms that warrant direct hyperlinks, I noticed the speaker works for Infotopics, which could be of interest to readers.

Jorrick: The project ran during the second period of the Quality in Software Development semester at the HBO ICT Bachelor program at Windesheim in Zwolle, which is a great semester for anyone who wants to do programming. We worked on it with a team of eight developers, including myself. I really want to mention their names because they are awesome. The names are (look them up on LinkedIn): Jasper van Willigen, Martijn Schuurman, Ties Schreve, Teun van der Kleij, Jeroen Terpstra, Bas Plat, and Stefan Lont. All Dutch names, really difficult.

Our project manager, Ernst Bolt, came up with the idea of predictive scaling and oversaw the entire project. He paid for Vultr, and we really had a great time with him. We also had a lot of support from Windesheim. Our school mentor, Sjoerd Brouwers, helped us manage the entire process of building something cool and great.

I really had a great time doing the project. I think the combination of the project and all of the awesome people I worked with was probably the most epic thing I've ever done in coding.

Bart: Very cool. Well, what's next for you? What's your next challenge or project?

Jorrick: The next semester I'm doing an internship at Infotopics in Apeldoorn. After that, I'm graduating and will see where life takes me. I don't know—we'll see.

Bart: If people want to get in touch with you, what is the best way to do so?

Note: In this transcript, there are no specific technical terms or names that require hyperlinking based on the provided LINKS table. The text appears to be a generic question about contact methods.

Jorrick: You can contact me on LinkedIn. I think there's a link somewhere after the podcast is uploaded. Contact me using my full name. Not a lot of people have my name.

Bart: That's true. You're the first Jorrick we've had on KubeFM. Possibly the only one. It's a unique name. It was great chatting with you and learning from your experience. I look forward to seeing what you're doing next, embracing the challenges, teamwork, and collaborative aspect. It's really nice to see people using Kubernetes for the first time and not dying in the process. Congratulations on all your hard work, and I look forward to speaking to you soon.

Jorrick: Thank you. It was great being here.

Listen anywhere

Kubernetes experts reacting to this episode

Building Smarter Kubernetes Clusters with Event-Driven Autoscaling

with Zbyněk Roubalík