The Data Engineer's guide to optimizing Kubernetes

The Data Engineer's guide to optimizing Kubernetes

Oct 14, 2025

Host:

  • Bart Farrell

Guest:

  • Niels Claeys

This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io

Niels Claeys shares how his team at DataMinded built Conveyor, a data platform processing up to 1.5 million core hours monthly. He explains the specific optimizations they discovered through production experience, from scheduler changes that immediately reduce costs by 10-15% to achieving 97% spot instance usage without reliability issues.

You will learn:

  • Why the default Kubernetes scheduler wastes money on batch workloads and how switching from "least allocated" to "most allocated" scheduling enables faster scale-down and better resource utilization

  • How to achieve 97% spot instance adoption through strategic instance type diversification, region selection, and Spark-specific techniques

  • Node pool design principles that balance Kubernetes overhead with workload efficiency

  • Platform-specific gotchas like AWS cross-AZ data transfer costs that can spike bills unexpectedly

Relevant links
Transcription

Bart: Running batch workloads on Kubernetes isn't just about scale, it's about efficiency. Imagine handling up to 1.5 million core hours a month, but cutting your cloud bill by 70 to 90% using spot instances. In this episode of KubeFM, Niels Claeys, lead engineer at Dataminded, explains how his team built Conveyor, a data product workbench on Kubernetes, and the techniques they use to optimize clusters, from switching scheduling strategies to bin packing to taming daemon set overhead. If you're scaling batch workloads and want to understand the real trade-offs between scheduling strategies, spot instances, and node design, you'll want to hear this.

Special thanks to Testkube for sponsoring today's episode. Need to run tests in air-gapped environments? Testkube works completely offline with your private registries and restricted infrastructure. Whether you're in government, healthcare, or finance, you can orchestrate all your testing tools, performance, API, and browser tests without any external dependencies. Certificate-based auth, private NPM registries, enterprise OAuth—it's all supported. Your compliance requirements are finally met. Learn more at testkube.io.

Now, let's get into the episode with Niels. So Niels, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Niels: The first thing that comes to mind is Karpenter support for Azure. We've done some preliminary tests, and I'm really looking forward to it. Testing has progressed since the 1.0 release three months ago, which I think is a good time to get started.

Secondly, there's more general GPU support within Kubernetes. We use Sysbox heavily for running IDEs on top of Kubernetes, and I'm particularly interested in its GPU support.

A third area I've recently become interested in is k6, the load testing framework from Grafana Labs. I want to use it to benchmark and understand how much load we can handle within the components of our Kubernetes cluster.

Bart: So, for people who don't know you, could you tell us a bit more about yourself, where you work, and what you do?

In this transcript, I would link the company name to its website:

So, for people who don't know you, could you tell us a bit more about yourself, where you work, and what you do?

Where DataMinded is hyperlinked.

Niels: My name is Niels Claeys, and I work for DataMinded. DataMinded is a data consultancy organization based in Belgium. We primarily work in Belgium and Germany. We're a team of about 60 experts. Within DataMinded, my role is lead engineer. I work on our product conveyor, which we will probably discuss more later. Additionally, I'm in charge of technical screening for new team members, which is something I really enjoy alongside my technical work.

Bart: Maybe you could talk about how the technical screening works at DataMinded. What are the things you look for? I ask this because we also have a website called Kube Careers for people looking for jobs. It's always interesting to see what different organizations value during the application process.

Niels: That's a very interesting question. I'm mainly looking for people with an analytical mind who can reason about problems, come up with different solutions, and then make trade-offs based on the situation. When presented with three potential pathways, I want to understand why they chose a specific approach.

This analytical mindset ensures that someone will become a good engineer or continue to improve. We can discuss why they selected a particular solution.

Since we're a consultancy organization, I also heavily focus on communication skills. Can they translate technical skills into conversations that business people understand? It's crucial to explain technical issues in a way that highlights their impact on operations. If you simply tell a manager that "the Kubernetes cluster is in trouble," they might not understand the significance or why they should allocate resources to address it.

These are the two key skills I look for during interviews.

Bart: Fantastic. It's something we hear from many engineers: understanding how Kubernetes works under the hood is one thing, but how do you translate that into business value? In an episode I was recording just yesterday, I was explaining it as if telling a five-year-old. For people who struggle with that, do you have any tips or resources that would be useful?

Niels: Not necessarily resources specifically. For example, if you want to know more about communication, one resource that's typical is Simon Sinek. I like the way he talks, though that's not technical. I think what you can practice, and what I find very valuable, is to do whiteboard sessions with a few team members. Discuss and explain why you take a certain decision, whether it's an architectural wall or a purely technical language choice. Just reason about this. It helps you think about how you would structurally explain things to other people and can provide some helpful alternative perspectives.

Bart: Fantastic. So Niels, tell me more about your journey. How did you get involved in Cloud Native? Dataminded seems to be an interesting part of your background.

Niels: I started 12 years ago as a Java engineer within an organization when the DevOps mindset was not always prevalent. I mainly concerned myself with building applications, Java, typical packaging, and basic stuff. Over time, I wanted to have more control end-to-end and reduce dependencies on the operational team. So I learned more about the CI-CD process, Docker, and Terraform to manage infrastructure on AWS. Gradually, I got into the cloud-native stack because I liked it. It gives me more options to control the end-to-end lifecycle of what I built and ensure it runs and operates well.

Bart: I notice that the transcript snippet is very short and lacks context about what "Cloud Data" refers to. I would need more context to confidently link any terms. Could you provide more of the surrounding conversation or clarify what specific context this statement was made in?

Niels: I was a Java developer. I built applications with what was then a Spring Boot stack. It still exists. I don't know how it has evolved because it's been a long time since I've written Java code. I would consider myself a pure backend developer back then.

Bart: And in terms of the Kubernetes ecosystem, we know that it moves quickly and it can be challenging for people to stay up to date. How do you do it? Is it through blogs, videos, tutorials? What works best for you?

Niels: I read a lot of blogs and follow publications on Medium or Substack from people I find interesting. I have a couple of newsletters that I follow. LinkedIn also works well for me because you can tailor who you follow and get the right content.

As a last tip, sometimes going to a conference can provide a very different perspective. What I often find in my feeds and blogs is content that reinforces ideas I already know well. To get a curveball and something outside my typical space, I like to go to conferences and discover something new that makes me think, "I need to check this out" or "I need to try this."

Bart: And Niels, if you could go back in time and give yourself a piece of career advice, what would it be?

Niels: I think what I now value most is keeping things simple. When I started, I was an engineer who thought he was smarter than others and liked to build very complex systems to show his skills. Now, I truly believe that writing code as simply as possible, or creating an architecture that looks almost too simple, is a better way of demonstrating skill than building a complex piece of code that nobody understands.

Bart: Fair enough. As part of our monthly content discovery, we're always looking at different things out there. We came across your article, "The Data Engineer's Guide to Optimizing Kubernetes". We want to take a look at this in more detail. But before we dive into today's topic, can you tell us about Conveyor and what kind of data platform you built on Kubernetes?

Niels: Conveyor is what we call a data product workbench. The core goal of Conveyor is to make users more efficient in building and scheduling day-to-day pipelines. We take away the complexity of the non-functional aspects of building these pipelines: how they get scheduled, where they run, how to make them cost-effective, how to increase performance, and how to secure them. This allows clients to focus on their core task—having input data sets and performing transformations with Spark or dbt to get the desired output result. We handle the underlying plumbing.

Bart: Could you share some numbers about your workload volumes and what makes batch processing on Kubernetes particularly challenging?

Niels: We have customers running at various scales. For large-scale customers, if you count core hours, they range between 500,000 and 1.5 million core hours. To put this into perspective, that translates to 200 or 300 standard nodes running continuously for a full month. However, since this is batch processing, none of these nodes are actually running continuously. We observe significant fluctuations in the number of nodes running in our cluster throughout the day, depending on when workloads need to be scheduled.

Bart: Just out of curiosity, I ran the data on the Kubernetes community for three years and heard a lot about folks doing stream processing. Do you have any experience with technologies like Kafka or Red Panda, or the StreamZ operator for Kubernetes?

Niels: We natively integrate well with Spark structured streaming, since we already have Spark support in our toolbench. We have support for Spark streaming and alpha or beta support on Kafka Streams. However, our customers don't have high demand for Kafka Streams because we see many difficulties. Kafka Streams represents a very different paradigm for our customers. While switching from Spark to Spark streaming happens quite often, the switch to Kafka Streams is not as common. Consequently, we haven't invested much time into Kafka Streams.

Bart: Now, you mentioned that Kubernetes nodes have significant overhead that reduces available resources. What exactly makes up this overhead and how did you approach optimizing it?

Niels: That's a good question we always need to explain to our users. We manage Kubernetes clusters, and every node has a certain available memory, CPUs, and disk. Take a typical standard node with 16 gigabytes of RAM and four vCPUs. A job running on this node doesn't have full resources available because of Kubernetes overhead.

This overhead is composed of two parts. First, there's a node overhead—resources reserved for the operating system and eviction thresholds. Second, there's the daemon set overhead, which Kubernetes administrators can control by selecting which daemon sets run in the cluster. Since these daemon sets run on almost all nodes, you want their resources to be as small as possible to maximize job resources and minimize the number of nodes needed.

For a standard node with 16GB RAM and 4 vCPUs, a job actually has about 14GB RAM available. While 4 CPUs are technically available, the CPU limits are more throttled—it's not a hard limit. The operating system takes some CPU overhead, but applications can use up to four CPUs before being throttled.

Bart: Now, the default Kubernetes scheduler seems poorly suited for batch workloads. What specific problems did you encounter, and how did you solve them?

Niels: Maybe as a small remark, it's not necessarily that the standard Kubernetes scheduler is poorly suited for batch processing. Kubernetes itself has support for it, but the cloud-supported Kubernetes clusters like EKS, AKS, and GKE have very specific default settings for launching applications. What we see is that these cloud providers only support the least allocated strategy when deciding where to launch new pods on nodes.

The problem is that with a highly fluctuating number of pods and nodes in a cluster, always distributing workloads to the least allocated nodes means spreading them out as much as possible. This prevents quick scale-down and reduces efficiency, ultimately causing customers to pay more.

This is one of the core problems we noticed over the past three to four years, which we have tackled over the last two years by changing the strategy from least allocated to most allocated. Instead of spreading pods out as much as possible, we want to consolidate all pods on a node until it's full, and only then move to the next node.

Bart: We noticed that your customers achieve 97% Spot Instance usage, which is quite remarkable given many organizations avoid Spot entirely. What makes Spot Instances so compelling for batch workloads despite their unreliability?

Niels: The main driver here is cost. As I already mentioned in a previous part, our customers run many jobs. The cost reduction of a spot instance is between 70% to 90% compared to an on-demand one. So we heavily see that for almost all workloads, they want to prefer spot over on-demand. There is only a small portion of workloads where they actually want to run on-demand because reliability is crucial. And of course, within our platform, we also work hard to reduce the impact of these spot interruptions and reliability problems.

Bart: It seems like managing spot interruptions is critical to your success. What specific techniques do you use to minimize both the likelihood and impact of spot terminations?

Niels: The first step is to minimize the likelihood of a spot interruption happening. There are several factors that influence the probability of a spot interruption. If you work on AWS, they have a Spot Placement API where you can request information about the likelihood of a spot interruption for specific resources. You can add constraints to decide, for example, in which availability zone to run a certain job and request resources in the zone least likely to be spot-interrupted.

We've observed that shorter jobs are less likely to be interrupted. On AWS, the number of EC2 instance types you support is crucial. If you only have a specific instance type (like the general purpose M7XL in AMD or Intel version), you might run into issues. However, if you can support multiple types with similar resource requirements, this can significantly reduce the likelihood of a spot interruption.

The region also has a huge impact. Most of our customers in Europe find that data centers in Ireland (EU-West-1) are the most mature and have the most available capacity. In contrast, regions like Paris have fewer supported instance types and more frequent spot interruptions. Being flexible about region selection can be beneficial.

The second aspect we focus on is minimizing the impact when a spot interruption occurs. For Apache Spark workloads, this is relatively manageable. If you have one driver and between one and 50 executors, a spot interruption of an executor isn't a major problem. Spark can handle this by launching a new executor and potentially redoing part of the operation.

We can run the driver on-demand and executors on spot instances to ensure the Spark job will succeed. Since Spark 3.2, the decommissioning support helps further by transferring shuffle data to another running node, preventing the need to redo entire calculations.

Bart: Batch workloads typically have dramatic usage patterns with peaks and valleys. How do you handle scaling when there's a 10x difference between minimum and peak utilization?

Key terms that could be explored:

  • Scaling could be linked to Horizontal Pod Autoscaler

  • Utilization relates to resource management in Kubernetes

Niels: So that's a problem we have been tackling for a while. If we look at a typical customer, they might have a baseline of 50 nodes during daytime because there's not much running, but at night that goes up to 400 or 500 nodes for the same customer. What you want to do is quickly scale up and be able to quickly scale down when certain resources are not needed.

For the components, the open source options are Karpenter and the Cluster Autoscaler. In our product, Karpenter is the default on AWS, and on Azure, we're still using the Cluster Autoscaler. I would like to test Karpenter on Azure to see if we can migrate, because the Cluster Autoscaler is much slower in scaling up and down compared to Karpenter.

Karpenter has a very nice feature that provides more flexibility in deciding node resources and supported instance types. You can give a broad configuration for node creation, whereas with the Cluster Autoscaler and node pools, it's much more tedious to define the available options.

Bart: Karpenter definitely seems to be a game-changer for AWS users. What specific advantages does it offer over the traditional Cluster Autoscaler?

Niels: Karpenter is way faster in scaling up for several reasons. Instead of the Cluster Autoscaler being dependent on autoscaling groups with inherent delays in creating nodes, it uses a different API to immediately launch EC2 instances.

It supports flexible configuration, including a price capacity optimization setting that is by default within Karpenter. For those unfamiliar, when you request a given amount of resources from AWS, it will examine all available instance types that can meet your needs. From these options, it selects the instance with the least likelihood of a spot interruption and the best price. This combination is particularly useful when running batch workloads—a handy feature that doesn't exist in the cluster autoscaler.

Bart: Cross-availability zone data transfer can be surprisingly expensive. How did you discover this issue and what was your solution?

Niels: This is indeed one of those gotchas that you only notice when running in production. A customer comes and says, "We have a problem. We see that our data transfer costs, instead of the regular €1,000, have spiked up to €15,000. What happened here?"

Basically, this boiled down to our Apache Spark deployment. When running Spark with a driver and up to 50 or 100 executors, if you let them launch freely across different availability zones, Spark does a lot of shuffling between its steps. This might result in shuffling significant data across AWS availability zones, which incurs data transfer costs. On AWS, you need to pay for cross-AZ data transfer, whereas on Azure, that's not the case.

The solution is fairly simple: If we run in three different availability zones, we now decide which availability zone a job will launch in. We pin the availability zone so that both the driver and all executors run in the same zone. Our Kubernetes operator handles this before launching the actual Spark application.

Bart: You mentioned that AKS and EKS have different characteristics and limitations. What are the key platform-specific considerations that teams should know about?

Niels: That's always a tricky problem because we want to support both platforms, but they evolve at different speeds and don't give the same freedom to sysadmins to manage clusters. For example, on Azure, AKS gets a lot of development, but they add many features as preview features, and it takes a long time before they become generally available or are supported in the Terraform provider. There are many moving parts, but the vision is not always clear. Dealing with these preview features—whether we want to adopt them—is something we often struggle with.

Additionally, the fact that you cannot customize the image for AKS nodes on Azure is challenging because we want to bake in tooling that is immediately available instead of being pulled when the node starts. It's all about startup time. On AWS, you can create a custom AMI and package everything together. We build it once and can immediately use it. But on Azure, this is not supported.

When building an ecosystem around your cluster and using open source components, not all deal equally well with both AWS and Azure. I mentioned Karpenter—now it's available on Azure, but before it was not. You need to support basically two autoscalers, which splits your expertise.

On the other hand, Azure has advantages like not charging for cross-availability zone transfer, which AWS does. It's not that one cloud is better than the other. But if you are experienced, you know which Kubernetes configurations to adjust.

Most people would prefer—and I speak for myself—EKS over AKS because AKS is more shielded and provides less freedom to configure resource reservations or kubelet resources. On AKS, they prohibit configurations much more strictly. EKS allows more freedom to change and test configurations, which I prefer.

Bart: Self-managing Kubernetes components seems to be a recurring theme. When should teams consider self-managing versus using cloud provider add-ons?

Niels: I think you should always start with cloud provider add-ons because they get you up and running quickly and provide something to test against. The only reason to switch to self-managed is if you encounter problems or hiccups with cloud-managed add-ons.

These hiccups often arise because, while these are open-source components with many configuration options, the interface exposed by EKS or AKS is much more limited. If you hit an issue where you want to configure something differently and it's not supported by the cloud provider, that's often when we decide to manage it ourselves to have full configuration freedom.

An example is on Azure with the Cluster Autoscaler. We couldn't define startup taints, which we use heavily with Cilium. Cilium needs to be installed on the node before launching anything. The auto-scaler couldn't handle these startup taints, causing issues with scaling up nodes. Since we needed this specific setting, we decided to manage it ourselves.

Another example is the OMS agent for Kubernetes logging and metrics. The default component required resources that were too heavy, and we couldn't tweak it to our needs. So we picked an open-source alternative, customized it, and deployed it on Kubernetes.

The approach is to first use the cloud provider's add-on, then assess if you're running into issues. If so, you can decide that self-management might be easier. The downside is increased operational burden, so it's always a trade-off. You shouldn't blindly manage everything yourself, as it can significantly increase operational overhead.

Bart: Node pool design seems crucial for efficiency. How do you determine the right mix of node sizes and types?

I noticed a few potential areas for deeper exploration:

  • The mention of "node pool" could be linked to Kubernetes node management concepts

  • "Node sizes and types" relates to cluster infrastructure planning

  • The efficiency aspect suggests potential use of tools like Cluster Autoscaler or Karpenter for optimizing node provisioning

Would you like me to elaborate on any of these points?

Niels: That's a very good question and something I get asked a lot. The short and not helpful answer is: it depends. It depends heavily on your workload.

To provide more guidance, most people using Kubernetes clusters end up with multiple types of node pools. We have a default node pool with 16 gigabytes of RAM and 4 vCPUs. This is the default where we launch jobs with no specific large resource requirements—essentially the heavy workhorse with the most nodes.

Our customers can decide the amount of resources for every job using a t-shirt sizing approach: micro, mini, medium, large, xlarge, 2xlarge, 4xlarge. When jobs require resources beyond an xlarge, we configure different node pools to provide dedicated nodes for these large workloads.

The reason for this configuration is balance. If nodes are too small—lower than 16 gigabytes of memory and 4 vCPUs—the Kubernetes overhead becomes too high relative to the workload. Remember, nodes have overhead from daemon sets and OS reservations.

Conversely, if you use very large nodes—say 256 gigabytes of RAM and 100 CPUs—you can run many jobs, but in batch workloads, you cannot easily terminate a specific job. Once a job runs on that node, the entire node must continue running, potentially wasting significant resources.

That's why we take a mixed approach: a standard small node pool with the ability to escalate to dedicated large nodes when needed.

Bart: Daemon sets can significantly impact node efficiency. What strategies do you use to minimize their overhead while maintaining necessary functionality?

Niels: It's crucial for the administrator of the Kubernetes cluster to think about which functionalities they really want to port by default for all nodes or specific node pools. Of course, you can have different node pools for different workload types and add certain tolerations that specify which daemon sets need to run on specific nodes.

The key is to decide which daemon sets are actually necessary because they consume resources that could be used by running jobs. We need to carefully consider the functionality that is required and cannot be dispensed with—typical things like logs, metrics, and node termination handlers.

The first aspect is deciding which components to run on a node. The second is critically examining the resources each component requires. As an illustration, we started running Kubernetes clusters five years ago and used Fluentd as a log aggregator to move logs from pods to CloudWatch or Log Analytics Workspace because it was easy to configure and the default.

The problem was that, being written in Ruby, Fluentd was quite inefficient resource-wise and required significant memory. That's why we switched to Fluentbit, which is far more efficient. While its functionality is somewhat more limited, it requires far fewer resources, allowing our customers to have more resources available for their jobs—providing more value for the same price.

As a system administrator, look critically at which daemon sets you have and whether you truly need them for all use cases. Then, carefully examine the resources these daemon sets require and optimize them where possible.

Bart: For teams just starting with batch workloads on Kubernetes, what would be your recommended optimization priority order?

Niels: I come from a development background, and I really like the idea that Uncle Bob always uses: first make it work, then make it right, and then make it fast. I think this is the same if you want to optimize the efficiency of your Kubernetes cluster. First, make sure that you have something running. If you have something running, measure its performance and be able to detect where certain bottlenecks could be. Based on that, start optimizing.

If you start optimizing, it will probably depend on the organization, but key aspects to look for are: first, try to optimize for cost, with the major cost improvement being spot instances instead of running regular instances. The second optimization is looking at scheduler changes, seeing that instead of spreading workloads, you want to bin-pack them as closely together as possible.

These are two very tangible ways of optimizing your cluster. However, this is not a finished product—a Kubernetes cluster is an evolving, continuous process where we always look at where we can save money or improve performance by testing and seeing what works well.

Bart: Looking at all these optimizations, what were the most impactful changes, and what lessons would you share with others embarking on this journey?

Note: No specific hyperlinks could be confidently applied to this transcript snippet without additional context. The only potential link might be to DataMinded since Niels works for this company, but it doesn't directly relate to the optimization discussion.

Niels: Using spot instances is a huge game-changer. We see that when customers adopt our tool, they begin using spot instances much more. Customers coming from other tools are amazed by the amount of savings without a reduction in reliability—or with almost no reduction.

The bin packing feature is particularly cool because it's something users don't see at all. As a platform team, you can implement this and immediately give customers a 10 to 15 percent saving without any impact on their end. If your cloud bill is 20 to 50K, and you can reduce your EKS cluster bill by 10% just by switching this, customers are super happy.

Looking at optimizations, I'm always fearful of premature optimization—thinking we need to do something because it will perform better, when in the end, it turns out to be unnecessary or even counterproductive. Be careful not to optimize too early.

One way to avoid premature optimization is to measure first: assess the performance or delay, then look at all the components involved and identify where you're losing the most time. Find your bottleneck and optimize only that. If you have a request that takes 10 seconds, trying to optimize a five-millisecond part won't make the biggest difference. Always look for the major time-saving opportunities and focus on where you lose the most time—that will have the greatest impact.

Bart: Niels, what's next for you?

Niels: What's next? That's a good question. I love what I do. I love looking at new technologies and incorporating new tools. One concrete thing I'm really looking forward to is speaking at the European Big Data Conference in a couple of months. It will be my first talk to a big audience, and I'm excited about the discussions I'll have there.

Bart: And for folks that might be there, what day and place is this going to be?

Niels: I'm going to be at the Big Data Europe in Vilnius from the 18th to 21st of November. My talk will not be specifically on Kubernetes. It will focus more on using DuckDB instead of Databricks for batch data pipelines, emphasizing the data aspect rather than the Kubernetes aspect.

Bart: Niels, this is interesting and best of luck to you with your talk. And if people want to get in touch with you, what's the best way to do that?

Niels: The best way to get in touch with me is through LinkedIn, where I'm very active. I also write a lot on Medium and Substack, so you can reach out to me on those platforms as well.

Bart: Fantastic, Niels. It was great talking to you today. I look forward to hearing about how things went at the conference in November. I'm sure we'll be seeing you on stage in many parts of the world. Keep up the great work. I hope our paths cross soon. Take care.

Niels: Thank you for having me. Goodbye.

Bart: Goodbye.