Scaling CI horizontally with Buildkite, Kubernetes, and multiple pipelines

Scaling CI horizontally with Buildkite, Kubernetes, and multiple pipelines

Host:

  • Bart Farrell

Guest:

  • Ben Poland

This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io

Ben Poland walks through Faire's complete CI transformation, from a single Jenkins instance struggling with thousands of lines of Groovy to a distributed Buildkite system running across multiple Kubernetes clusters.

He details the technical challenges of running CI workloads at scale, including API rate limiting, etcd pressure points, and the trade-offs of splitting monolithic pipelines into service-scoped ones.

You will learn:

  • How to architect CI systems that match team ownership and eliminate shared failure points across services

  • Kubernetes scaling patterns for CI workloads, including multi-cluster strategies, predictive node provisioning, and handling API throttling

  • Performance optimization techniques like Git mirroring, node-level caching, and spot instance management for variable CI demands

  • Migration strategies and lessons learned from moving away from monolithic CI, including proof-of-concept approaches and avoiding the sunk cost fallacy

Relevant links
Transcription

Bart: If you've ever pushed a large engineering org CI/CD system to the breaking point, you'll know the pain. Massive pipelines, controller bottlenecks, and jobs that seem to fight your Kubernetes cluster instead of finishing builds. Today's guest on KubeFM is Ben Poland, Senior Staff Platform Engineer at Faire, and he's lived that problem firsthand. At Faire, their original Jenkins monolith, just thousands of lines of Groovy, simply couldn't keep up. Ben and his team rebuilt the system around Buildkite on Kubernetes, and in the process, uncovered the trade-offs of running CI as workloads and Kubernetes at scale.

In this episode, we'll get into how they split a monolithic pipeline into service-scope pipelines that actually match team ownership, why just running it in one cluster doesn't work once you hit scale, and how they built a multi-cluster CI model. We'll explore the gnarly details of job-to-pod mapping, throttling, and the etcd pressure points you only see once you're deep into it, and tricks like predictive node provisioning and node-level caches that shave minutes off builds. And at the end of the day, what this has all delivered: faster PR cycles, more reliable pipelines, and happier developers.

If you're an engineer wrestling with CI on Kubernetes or curious about how to design a system that scales beyond the basics, this episode is packed with lessons you can take back to your own team. Special thanks to Testkube for sponsoring today's episode. Need to run tests in air-gapped environments? Testkube works completely offline with your private registries and restricted infrastructure. Whether you're in government, healthcare, or finance, you can orchestrate all your testing tools, performance, API, and browser tests without any external dependencies. Certificate-based auth, private NPM registries, enterprise OAuth—it's all supported. Your compliance requirements are finally met. Learn more at testkube.io.

Now, let's get into the episode with Ben. Welcome to KubeFM, Ben. What are three emerging Kubernetes tools that you're keeping an eye on?

Ben: This is an interesting discussion. I'm not super up to date with all the latest stuff, so I'll give a spicy take on a Kubernetes tool. Helm is something folks commonly use to package Kubernetes applications, but I'm actually not a huge fan.

The main issue I have with Helm is its YOLO nature. Sure, you can dump out the manifests it's about to apply to your cluster, but you don't necessarily know what changes that will result in. I've personally experienced outages caused by this: we applied a Helm chart on day one, then by day three, when trying to apply a Helm change, what we thought would happen isn't what actually occurred, causing trouble.

In contrast, tools like Terraform or OpenTofu show you a diff of what's currently in the cluster and what you're about to do. While there is a Helm diff plugin, it only shows a diff between Helm's last application and the next, without truly understanding what exists in the cluster. For a Kubernetes native tool, it would be nice if Helm had more knowledge about the cluster and could transparently show that information.

Bart: Good, spicy start. We like that. We've had podcast guests also lament some of Helm's shortcomings. The design was fundamentally flawed. So I think you'll find that you're not alone there. But it'll be interesting to see how folks in the community will react to that. Ben, for people who don't know you, can you tell us quickly about what you do and where you work?

Ben: I am a senior staff platform engineer at Faire, F-A-I-R-E. We're a wholesale marketplace connecting brands and retailers. If you think about your local gift shop or other independent retail store, the question is: Where are they finding the products that they stock? The answer is Faire. We're a storefront where retailers can search for products they want to stock, buy whole caseloads, and have them shipped to their store to sell to end users.

I've been at Faire for about five and a half years—almost six years, which I'll reach in January. I've seen the entire platform team grow from a handful of people to now having dozens of platform engineers. My day job is currently a tech lead on our developer productivity team, focusing on CI, CD, and internal tools.

Bart: How did you get into Cloud Native?

Ben: My career started at the University of Waterloo in Canada. I worked at BlackBerry, whose headquarters is in Waterloo. After university, I was at BlackBerry for eight and a half years. When I started, there was essentially no cloud—just vSphere for those familiar with it. We were more focused on building and shipping software that customers would install on their machines.

I was in the enterprise software group, specifically the device management side of the business. Over time, customers began requesting that we host the software and charge a monthly fee, which was a less common model then, especially for larger businesses. We started exploring how to reuse our existing software by hosting it ourselves.

That's how I got into cloud computing. We initially looked at AWS and ultimately settled on Azure. It was a whole new world that seemed super interesting and quite new at the time. I transitioned into the DevOps side, focusing on how to deploy, manage, and update our solution. This was before Kubernetes—just cloud and virtual infrastructure.

Bart: And what about Kubernetes? When did you first get involved with that?

Ben: They were using Kubernetes, and that was another real aha moment for me. I loved the building block, extensible nature. I played with lots of Lego when I was a kid, and it felt similar—like assembling different pieces together and being able to reuse those pieces for different uses. I really latched onto it; it just clicked and made sense in my brain. Kubernetes is something that I've definitely made sure to pursue in subsequent jobs because it's a lot of fun.

Bart: In the Kubernetes and cloud-native ecosystem, which moves very quickly, what resources work best for you to stay up to date?

Ben: I'm not necessarily best at keeping up with the latest technologies. When I encounter a problem or hit a challenge, I often look for solutions at that point, ideally finding something relatively polished. One of the great things about Kubernetes is the ability to find and integrate cool solutions. At Faire, we've found plenty of open source and paid solutions that help us. Kubernetes is great that way.

Bart: If you could go back in time and share one career tip with your younger self, what would it be?

Ben: I was at BlackBerry for eight and a half years. There were times when I felt a little stagnant, looking back and realizing I was plateauing. If I could go back, I would say: don't settle or sit around waiting for things to happen in your career. If you find something you're interested in and passionate about, it's okay to shape your career at your current company or look elsewhere.

I was in a DevOps-related role and ended up in a unique spot doing something different from everyone else on my team, simply because that's what the company needed. During performance reviews, I always felt like an outsider who didn't quite fit where I was supposed to be. Now at Faire, I definitely don't feel that way—I'm doing exactly what I'm supposed to be doing. Looking back, I wish I had pushed myself more to find the right place and team.

Bart: As part of our monthly content discovery, we found an article you wrote titled "Scaling Faire's CI horizontally with Buildkite, Kubernetes, and multiple pipelines". Faire operates a wholesale marketplace connecting retailers with brands. Can you tell us about their infrastructure setup, particularly the Kotlin monorepo, and what CI challenges emerged as your platform scaled?

Ben: We're a wholesale marketplace connecting retailers and brands. Looking back five to seven years ago, we started with a monolithic application backend, which is very common. As we scaled and added more code, we needed to split out more services from that monolith.

Today, we still have a monolith, which is also common. But we also have several dozen other services, such as search, carts, and product catalog. As a developer productivity engineer, I'm pleased that these services are architected using a common framework and language. This makes my job much easier, especially compared to organizations using multiple programming languages like Kotlin and PHP.

From a CI perspective, it's straightforward to build and test these services similarly. While the monolith has some legacy special cases, the other services are consistent. However, as we added more services, our existing pipelines and processes began to break down. We noticed negative trends in time to merge, time to deploy, and developer sentiment, which ultimately led us to pursue significant changes.

Bart: You mentioned starting with a single Jenkins instance that eventually grew to three. What were the specific limitations you hit with Jenkins as you scaled?

Ben: In 2020, we had a single Jenkins controller. By October 2021, we unexpectedly hit its limits faster than anticipated. We knew the challenges were coming but weren't fully prepared. Fortunately, we had a plan to deploy more Jenkins controllers, which we had to accelerate quickly because developers couldn't push code for a day or two.

When CI goes down, the longer it remains offline, the more code developers accumulate. When it comes back up, everyone starts pushing code and re-triggering builds, creating a thundering herd issue. We deployed a second Jenkins controller, then a third, and by late last year and early this year, we had over 20 Jenkins controllers to handle our CI load.

We discovered the Jenkins Kubernetes plugin struggled when reaching 1500-2000 pods. Adding more CPU and memory didn't help; it would bog down. There seemed to be a progressive slowdown with a critical tipping point around that pod count where the system could no longer function effectively.

Managing 20 Jenkins controllers was challenging, especially since Jenkins is not highly available. Whenever we needed security updates, which are frequent, we had to take controllers down, resulting in after-hours work for my team.

Bart: And beyond the infrastructure issues, you had what you called a monster monolithic pipeline. (Sounds like that should be a t-shirt or a sticker.) What made this single pipeline approach so problematic for your developers?

Ben: We had about four or five thousand lines of Groovy. That's hard for anyone to understand. We also had Groovy pipeline libraries being pulled in. It wasn't a great experience for anyone who had to read or make changes to it. As a result, only a few people on my team would make changes. It wasn't self-serve. People would come to our team and say, "I want to do this. Please help me because I have no idea what I'm doing."

For a platform team, you want to empower anyone to make the changes they need, rather than centralizing everything. The development experience was challenging: you make changes, push them up, wait for minutes, encounter a weird error, try to decipher the stack trace, then make another change and push again. It's not a fun pipeline development experience.

From a usage perspective, this pipeline had numerous parallel steps. It was difficult for developers to understand what was happening. When something failed, they couldn't easily determine what went wrong or how to fix it. Developers would often blindly retry without understanding the issue.

As we added more services, a build failure on the main branch would block deployments for every service, since we required CI to pass before deploying code. Once again, the system wasn't scaling. We knew we needed to do something different.

Bart: And to take this a little further, we're talking about naming and references. Kubernetes was originally called Borg, related to Star Trek, and you kicked off Project Farpoint to find a Jenkins replacement. It's nice to see that Star Trek reference. Can you walk us through your evaluation process and the key requirements you identified?

Ben: We wanted to pilot the next generation of CI at Faire. "Encounter at Farpoint" is the first pilot episode of Star Trek: The Next Generation, and that's where the name came from.

We knew we were having issues and needed to find a solution. I developed a comprehensive list of requirements. Kubernetes support was a significant part of that, along with security, task execution, and supporting Mac instances. We have iOS builds that need to be created.

I focused on ensuring a good user experience, developer experience for updating pipelines, observability, and security. I gathered input from other Faire team members to create a robust set of evaluation criteria for different solutions.

When we started exploring options, we considered building versus buying. We quickly realized that CI is a common need for many organizations, so the likelihood of finding a good off-the-shelf solution was high.

After a detailed proof of concept, we settled on Buildkite. I wrote a comparative document evaluating Buildkite against other options using our established criteria. Buildkite emerged as the clear winner.

Bart: To take that a little further, you brought up how Jenkins, using the Jenkins Kubernetes plugin initially, and then Buildkite's agent Stack K8s, differed in their Kubernetes integration.

Ben: Buildkite is not necessarily Kubernetes native, as they support other ways of running CI. However, the Buildkite agents were very open to feedback and iterating based on our input.

With Jenkins, the controller serves the UI, parses the pipelines, and distributes work while also handling creation and management of pods through the Kubernetes plugin. In contrast, the Buildkite architecture, with them hosting the control plane, allows them to scale as needed. We run the agent stack Kubernetes component in our own Kubernetes infrastructure.

The combination of Buildkite and our "poly-CI" approach—where each service has its own pipeline—allowed us to horizontally scale Kubernetes pod management. Compared to Jenkins, which has a 1500 to 2000 pod limitation per controller, our Buildkite and poly-CI architecture means each pod management controller handles a smaller amount of work. This enables us to parallelize and horizontally scale that component much more effectively, essentially allowing us to scale almost infinitely without encountering the issues we experienced with Jenkins.

Bart: Your trigger pipeline architecture determines affected services and orchestrates child pipelines. How does this poly-CI approach impact your Kubernetes resource usage compared to the monolithic pipeline?

Ben: With the poly-CI approach, instead of one huge build pod compiling the entire repo, we have multiple smaller build pods compiling each individual service. Rather than having a massive fleet of test executors running tests for the entire repo, we're able to scale more horizontally and focus on each service individually.

We ended up with many more pods in our clusters compared to Jenkins, but these pods are quite a bit smaller. This did have some impact, and we ran into scaling issues that we'll discuss later. We rolled out the migration carefully, starting with a detailed proof of concept with Buildkite focused on a single service. Some scaling challenges and other issues weren't apparent until we rolled it out more widely.

We've ended up using a similar amount of CPU and memory, perhaps a little more due to overhead. The sheer number of pods has increased, which did end up causing some challenges we had to adjust for.

Bart: You hit some interesting Kubernetes scaling limits as you rolled out more widely. What were the specific issues you encountered with API rate limiting and the etcd database?

Ben: We have more pods that are smaller. We ended up hitting API issues. To get into the weeds of how the Buildkite agent stack in Kubernetes works: when a job or CI step gets created in the Buildkite API control plane, the agent stack Kubernetes controller running in our infrastructure sees that work needs to be done, pulls it down, and creates a Kubernetes job to run that particular step.

We found that portion was fine. We had hundreds, thousands of these Kubernetes jobs getting created. But internally, the Kubernetes job controller looks at those jobs and says it needs to run a pod to execute that job. That part was actually getting throttled internally by Kubernetes. We would end up with jobs sitting there for one, two, five minutes without the corresponding pod starting. It was surprising that Kubernetes would throttle itself, but it makes sense when you think about it.

We had to make adjustments to work around that. It's still something we hit sometimes during peak load, though it's not necessarily fatal. The Buildkite jobs just end up waiting longer than needed.

Regarding the etcd space issue, it resulted from the sheer number of concurrent pods. When using a managed Kubernetes offering, you're never 100% sure. We saw a message about etcd space being exceeded, which I assume was because we had too many pods running simultaneously, though they never confirmed it definitively.

To solve both issues, we've ended up splitting our load across several Kubernetes clusters. In a perfect world, I would love to run everything in one cluster, but it's not the end of the world. It's also nice for redundancy and resiliency to have a couple of clusters we can switch between.

Bart: Spreading the load across multiple Kubernetes clusters became necessary. How do you manage CI workloads across multiple clusters and what operational complexity does this add?

Ben: I noticed that the transcript snippet you provided is incomplete. Could you provide the full transcript so I can apply the hyperlinking guidelines effectively?

Bart: Bart Farrell: Ben, you use a managed cloud Kubernetes offering. What specific limitations did you discover when pushing it to its limits with CI workloads?

Ben: We were talking about etcd issues and API throttling, which we often have to troubleshoot and discover on our own. When we encounter an issue, like the urgent etcd ticket, we were able to quickly migrate workloads to a different Kubernetes cluster. However, we wanted to know how close our other clusters are to these internal limits, since we don't have the metrics or data ourselves.

CI is not as demanding a workload in terms of uptime compared to production. If we experience some issues, our developers can handle a half-hour disruption. In contrast, half an hour of downtime on a production site is unacceptable. Our CI workload pushes the cluster differently, with many small pods constantly coming up and down while running builds.

Regarding preemptible or spot instances in AWS, we use these for CI workloads because we are somewhat fault-tolerant and can perform retries. When spot instances have issues, we consider moving back to on-demand non-spot instances, especially if the underlying data center is running out of capacity. We're more flexible in interpreting cloud provider data center conditions and can adapt machine types to improve our chances of success—a luxury we don't have with production workloads that require specific machine types.

Bart: Looking at node scaling patterns, you mentioned scaling down to zero overnight. How do you optimize Kubernetes autoscaling for the variable demands of CI workloads?

Ben: Our usage is very biased towards the day in North America. We end up getting down to pretty much zero Kubernetes nodes overnight. Unless it's a holiday, during every weekday around 9-10 a.m., developers have their cup of coffee, have written some code, and pushed it up. They're going to be kicking off CI jobs.

Every day we can see a huge spike in usage around that time. Then we see a little lull at lunchtime, followed by another increase in the afternoon, which tails off at the end of the day. We use this pattern to our advantage.

We've added node over-provisioning. We over-provision nodes in the early morning timeframe because if we didn't, developers would wait longer for pods to be created. Pods would go pending, and the cluster autoscaler would need to add more nodes, which takes a few minutes.

By knowing the load will spike, we deploy super low-priority pods at 9 a.m. and scale them back down around lunchtime. This kicks the autoscaler into adding more nodes ahead of demand, which has helped quite a bit.

Additionally, we use spot instances and different SKUs. Our cluster autoscaler manages multiple node pools and can struggle to find capacity or match specific instance types. To address this, we scale every node pool up to a minimum of one node at the start of the day. This allows the cluster autoscaler to better understand the available machine types and more intelligently scale to match demand.

Bart: The Git mirror functionality with object storage preseeding is a creative solution. How does this integrate with Kubernetes, node autoscaling, and lifecycle?

Ben: We don't only do this for Git. We also do it for Gradle cache and other things where we want to cache whatever we possibly can. Our monorepo is one thing, and Gradle dependencies that don't change often are another. We leverage node local storage as a cache for pods running on that node, which can mount the local storage and use it.

For the Git mirroring in particular, we download and extract a tarball with a mostly up-to-date copy of our repo on node startup. Buildkite has native Git mirroring functionality that allows avoiding cloning the entire repo on every job by using this local copy. We preseed that, which helps quite a bit. It means our nodes take slightly longer to start up, but once they start, we know they'll need the Git repo. Doing this at startup makes a lot of sense.

Similarly, with Gradle dependencies, we need them on pretty much every build. Having those available locally before jobs run makes sense. Since our nodes scale back down to zero overnight, any updates to Gradle dependencies or Git mirror can happen during that time. The next day when nodes scale up, they'll get the latest copy to keep the cache fresh.

Bart: And in terms of results, you mentioned a 50% reduction in PR wait times. What other metrics improved with the Buildkite and poly-CI combination?

Ben: The biggest thing is stability. We have a metric called infrastructure failure rate. When a build fails, if it's not because a test failed, not because of linting, or not caused by the code that was pushed, we classify it as an infrastructure failure.

With Jenkins, we were seeing between five and 15%, sometimes even higher on bad days, of builds failing due to infrastructure issues. With Buildkite, as we've rolled it out, we've been consistently below 1% and often close to a tenth of a percent of builds failing due to these infrastructure issues. That's been a huge improvement to our developer experience.

On the other side, manual build retries have changed. With Jenkins, people would just retry a build once to see what happens. With Buildkite, people are less likely to do that because they can clearly see what the issue is and think, "Yes, that's my fault. I can fix it." We've seen the number of manually retried builds drop from 20 or 50 or more daily with Jenkins to fewer than five per day with Buildkite.

In our developer surveys conducted every three months, we recently had a customer satisfaction (CSAT) score of 91% for Buildkite, which is incredible. We've been without Jenkins for a few months now, and it's been great to focus on other things instead of dealing with Jenkins' flakiness.

Bart: Now, looking at the bigger picture, you're also migrating front-end and mobile CI to the same approach. How are you applying these learnings across different technology stacks?

Ben: I mentioned before Terraform for Kubernetes clusters. We've also leveraged Terraform quite heavily to manage both the Buildkite agent stack in Kubernetes, as well as our Buildkite pipelines. With the poly-CI architecture and their trigger pipeline, we've been able to reuse Terraform for other repositories. It's nice to have that consistency and be able to share learnings, tools, and processes across teams. At this point, we've finished all migrations, gotten rid of Jenkins, and we're not looking back.

Bart: And you shared some great lessons learned. What advice would you give to other teams that might be facing similar CI scaling challenges with model repositories?

Ben: The biggest challenge was our monolithic pipeline. When I first considered splitting that big pipeline into smaller pipelines, I thought we'd be redoing tons of work and it wouldn't be efficient. Looking back, it's clear we needed to split apart our CI and get rid of shared fate to speed things up.

A monorepo doesn't mean your CI has to be monolithic. It's important not to fall for the sunk cost fallacy and assume you can't change your existing CI because it seems too difficult. CI is a huge part of developers' lives, and if they struggle with it, it affects their morale and ability to ship code. It becomes a significant bottleneck for the entire organization.

Our proof of concept was valuable because we went through the end-to-end flow and figured out the unknown unknowns with a small chunk of the repo. Taking time to get it right set us up for a smooth migration.

Lastly, CI evolves slowly over time. It's easy to bolt on different pieces and end up with a complex monster. Taking a step back periodically to ask if there are better ways to do things or if major changes are needed can be incredibly valuable.

Bart: Ben, what's next for you?

Ben: We talked about front end and mobile CI moving to Buildkite. Our data science organization at Faire is not on Buildkite right now, so we're potentially looking at ways to leverage Buildkite, poly-CI, and other improvements for the data science organization. Additionally, we're looking at Bazel, as we currently use Gradle for our Kotlin monorepo, which will be another significant project.

Bart: I noticed that the transcript text is missing from the input. Could you provide the full transcript that needs to be edited and hyperlinked? Without the actual transcript content, I cannot complete the task of identifying and linking relevant terms.

Ben: maybe.

Bart: Okay. And if people want to get in touch with you, what's the best way to do that?

Ben: My LinkedIn is bpoland. If you have any questions or want to chat, reach out to me there. Also, a mandatory note: Faire is hiring. If this sounds cool and you want to join us, visit fair.com/careers.

Bart: I noticed that the provided transcript appears to be a meta-analysis of a transcript, rather than an actual transcript. Could you provide the full, original transcript text so I can properly apply the hyperlinking guidelines?

Ben: Thanks.