From 0 to 10k builds a week with self-hosted Jenkins on Kubernetes

Host:

Bart Farrell

Guest:

Stéphane Goetz

This episode is sponsored by CloudBees — learn how to use Kubernetes pods as Jenkins agents

In this KubeFM episode, Stéphane shares his journey of migrating, optimizing and scaling Jenkins in Kubernetes.

He discusses the technical challenges, solutions, and strategies employed.

You will learn:

How Jenkins on Kubernetes was scaled to handle 10,000 weekly builds.
How they started their journey in 2015 and how the cluster has evolved in the past nine years.
The challenges of managing builds in Jenkins: Docker in Docker, Docker out of Docker and KubeVirt.
The lessons learned in created ephemeral environments.

Relevant links

Transcription

Bart: What happens when you have Jenkins on Jenkins on Jenkins on Jenkins? Today, we'll be looking at Stéphane's groundbreaking migration journey in 2023 when a company called Swissquote operated 50 automated Jenkins instances in Kubernetes, handling around 10,000 builds weekly. Today we'll explore more in-depth how Stéphane seamlessly transitioned from an old, unmaintained cluster to a new, state-of-the-art Kubernetes setup. We'll uncover the meticulous steps taken to migrate SonarQube and Jenkins controllers, address in-node issues, and optimize their builds. How did he do all this? Navigating these complex migrations with minimal downtime and enhancing CI/CD infrastructure for the future? You'll find out more about that in the episode. Before we get to that, today's sponsor is CloudBees. Kubernetes users benefit from adopting CloudBees CI for streamlined integration, optimized workflows, scalability, and flexibility. With seamless compatibility, taking advantage of Kubernetes scalability features, CloudBees CI enhances development processes, creating efficient pipelines, and at the same time offering advanced features like Pipeline Explorer, high availability, horizontal scaling, and workspace caching. This integration boosts developer productivity, improves team collaboration, and accelerates software delivery cycles, making CloudBees CI a valuable addition to the Kubernetes ecosystem. Check out how to use Kubernetes pods as Jenkins agents at the link in the description. First things first, you got a brand new Kubernetes cluster, which three tools are you going to install?

Stéphane: That's a tough question. I had to think about it for a while. I would say I'm a big fan of ArgoCD. I heard that answer many times on your podcast as well. But I just have to agree with this. It's a great tool for that. The second one is maybe a bit less known. As an ingress controller, we use Traefik because we've used the basic NGINX ingress. In the past, we weren't too happy about it, and we use Traefik in our homemade cluster that we'll talk about a bit today. We're really super happy about it, we're really fine. The third one, it's not exactly a tool that I would install on the cluster for the cluster itself, but at the site, I would use Backstage. It's a really intriguing ecosystem, being able to have all the information about your tools, about what is running on the cluster, and links to different observability platforms. It really puts everything together. It's great.

Bart: Very good. With that in mind, regarding observability, a very popular topic that's built out over time. Do you feel that the observability space, when it comes to tooling, can be overwhelming? Do you feel like it's going through consolidation? Where do you think that's at?

Stéphane: I always feel that there is something missing in observability because there are many different pillars to observability. There is the pure metrics part that works really well. There is Prometheus around that, Grafana, which works great. Then there is aggregating logs, for example, with an ELK stack. That is also fine. Where things start to get a bit messy is alerting and tracking errors, for example, because an error that happens in the logs of your application. It's interesting to have tools like Sentry, for example, which we've experimented with in the past. We're super happy with it, but it's putting everything together. And also the more APM-like stacks to have full traces of your application. This is also great. But putting everything together, I really feel today that some tools have most of it or half of it. And I don't know yet if it's a fully finished ecosystem.

Bart: Very good. Nope. Fair points. Now moving on to get to know you a little bit better. Can you tell our audience what you do and where you work?

Stéphane: So I'm Stéphane Goetz. I started my career as a web developer, focusing on PHP, HTML, and CSS. I am still a web developer today. I do a lot of work with React and TypeScript. My work has evolved a bit; I also do some Java. I started many years ago at Swissquote Bank, where I still work today. It's an online bank in Switzerland, but we are also available in many countries in Europe. My team and I work on many different projects, but we have focused a lot on developer tooling in the past. This is now handled by a sister team, which we'll talk about. We work a lot on architecture and providing libraries that enable developers at our company.

Bart: With your web developer background, how did you take your first steps into cloud native?

Stéphane: Many years ago, I had a very small company where I had to be a sysadmin. So I had some experience administering a Linux server. One day, the team and I decided to have a Jenkins cluster. We decided to install it ourselves, and that's how we jumped into CloudNative. It was our first experience with Kubernetes, containerizing our workload, and that's how we got into it.

Bart: Very good. With that in mind, for someone not necessarily from that background, when you mentioned React and TypeScript, some of the very large global communities, and then getting into the cloud native space, what were you before? What would you say is the difference, the before and after now that you're a part of both ecosystems?

Stéphane: The thing that I really like about having deployed applications on more classic environments and on Kubernetes now is the configuration files. Whether you like YAML or not, the ability to describe your environment and having Argo CD to handle the deployment really makes everything smooth and streamlined. You can then just scale the number of pods you have, and that works great. You can deploy a new application, deploy it ten times, and have templates that work super great. Not having that before was really a different world.

Bart: And it's an ecosystem that moves very, very quickly.

Stéphane: Well, to be fair, I'm not that much into the bleeding edge of the Kubernetes ecosystem. My personal ecosystem where I'm a bit too much on the bleeding edge is the JavaScript developer tooling. I know way too much about that ecosystem. On Kubernetes, mainly I listen to my colleagues because our SRE team, who implements our production clusters, is very aware of everything that is going on. They will be at KubeCon soon also. They went already in the past. And well, they know what they are going to do. We talk often about what they are going to implement in our clusters. We just think about ideas and more architecture aspects of that. But I'm not sure. Like you said, you've got your background in other areas where that's worked.

Bart: And then, like you said, in this case from coworkers that are working on this in their day to day, it's a great way to find out. If you could go back and give your previous self any career advice, what would it be? I would say...

Stéphane: Don't hesitate to say yes to opportunities, even if they sound crazy.

Bart: And the worst thing that can happen is you say, all right, now I know that I don't need to do that again. And then you got it out of your system. I like that. All right. So now to jump into the crux of the matter, the heart of this conversation. We found an article that you wrote titled "From Zero to 10,000 Jenkins Builds a Week." How did this journey start? Walk me through it.

Stéphane: Well, this story starts around 2015, so that's quite a while ago. That's when we had zero builds on Jenkins. It's around when I started at Swissquote. I started a bit before, but we really noticed that teams had their own Jenkins machines that they were running on all developer machines that they had under their desks. Most of the time they were broken or they did not know exactly how to update it or the build was just red because it was missing a configuration file somewhere. Also, the machines were not powerful enough because they were all dev machines by definition. They were not good enough anymore to be used as machines for development. They still did their releases on their own machines. That's how we noticed that there might be something to do there.

Bart: Good. All right. So you identified the pain point. What were the next steps you took to fix it?

Stéphane: That's around the time when our team was created to really enable teams. Before that, it was a group of people that met regularly. But that's when we really became a team. One of our first steps was to say, "Hey, we can streamline that process. We can create something that will help people have their builds always green and have that as an automated process." The first thing we tried was to get a bigger and more powerful machine from IT, install Jenkins on it, and have all the builds running on this one. That failed spectacularly because we had most of the same issues that we had before. Not all builds were there. Configuration of each build was missing. We needed the configuration files on disk for every single project we had in the company. Luckily, it was way smaller at the time. But that did not work. A colleague in the team said, "Hey, there is this new thing called Kubernetes. Have you heard about it? Because that could be cool. We can put things in a cluster. A cluster, by definition, should work to scale things." We were curious about it. We looked into it more to see how companies were starting to adopt it. It was 2015, so not a lot of companies were adopting it at the time. We decided to create a small MVP and have a small cluster. We started with four machines. Each team would get a Jenkins controller that they could control to add their own builds. Each build would run in its own Docker container. That was something we already had in our development environment. We already used Docker for sandboxes. We'll talk about that a bit later. We also wanted the configuration to be automated. If we have, let's say, a thousand different builds, we want them to automatically be set up with at least a default configuration that works fine in most cases.

Bart: Very good. Now, in terms of the architecture of the first MVP, can you walk me through that a little bit more? How does a single Mercurial chain travel through the entire pipeline? Sure.

Stéphane: Indeed, as you mentioned, we were using Mercurial at the time, not GitHub, which we adopted later in the process. The configuration was that you push a single change commit, similar to Git in that sense. You push your change to the central server, and we had a daemon that would listen to those changes, send the information, the payload of which repository and which commit ID was changed to a central server. This daemon would then look at which maintainer and team this code push belongs to and then contact the right Jenkins controller. If needed, it would create the job, and if the job already existed, it would update it or just trigger a build. At the end of the build, it would send an email to notify whether the build was successful or not. Was the MVP successful? Yes, it was a huge success. The biggest success we had was the full automation. Making the build creation automated was the biggest advantage because we used metadata on all our repositories to know which team they belonged to. We used mostly Maven, so POM XMLs. As soon as we detected which team it belonged to and a POM XML, we created the job for it. People did not even have to think about it. They just had their CI. People were happy about it, and we tweaked a few configurations here and there. Most of our team were on board instantly. We had around 20 teams back then. All of them, in the end, had their own Jenkins instance.

Bart: Were there any challenges that came about through the MVP?

Stéphane: Something that was a challenge from the very first day and is still a challenge to this day is the assets. When you have Docker images, NPM, Maven libraries, or anything similar, downloading them is quite a challenge. What you can try to do is have caches, but they also have some challenges. For example, writing the same library from two separate builds at the same time can end up pretty badly. Docker images can be quite big, and downloading them can also take a while. We did implement caches, but sometimes we had challenges and had to set up scripts to do the cleanup automatically. The other part was the configuration. If you use Jenkins 1, at the time the configuration was everything in the UI. It was small blocks that you had to move around, specifying the order of steps, such as collecting the reports for unit tests. Then you publish that HTML file to be accessible there. This was very complicated because it involved a lot of clicking around. Copying this from one build to another was also quite a challenge. Right after we started, Jenkins 2.0 came along with a new configuration system. It uses a Groovy scripting language where you can specify what you wish to do, and it will execute it for you. For example, commands like starting a node Docker image or building the Maven project. This allowed us to have a very basic script at the beginning to check out the repository, build it, and send an email if it failed. This made it much easier to get started.

Bart: All right. So all this is being done. What happens next? Do you just scale the MVP to 10K builds per week? What was the next step?

Stéphane: So mostly yes, we did scale that particular architecture to 10,000 builds a week, but we really changed a few things here and there along the way. One thing we did pretty early on is to add SonarQube as a duo to Jenkins. SonarQube is a code quality solution. We spawn one SonarQube instance per Jenkins controller. Each team also had the rights to configure their SonarQube, including what quality gates they wanted to put in place. Every single time they built something in Jenkins, it would also run the SonarQube checks. It was really easy because with this new configuration script and language from Jenkins, we also had the ability to have some shared functions, and we had a function to just run the build. We added the SonarQube steps in there, and most of the teams got it for free as soon as we rolled it out. That was a great addition. People really liked it, and we still use it today. There is also a new team that was created as a sister to our own team called Productivity. They are really more in charge of the pipeline and developer tooling. They standardized and really brought the standard pipeline that we initially built to the next level. They were able to add a lot of automated reports like Allure and tools to make sure that it works with all the specific environments from every team. Today, we even have support for mobile application builds. We have some Python as well. We really started with only Java because we are mostly a Java shop. Around 2019, I briefly mentioned it, but we migrated to GitHub Enterprise. That was quite a challenge because our entire ecosystem was running on Mercurial to check out repositories, manage releases, and gather information. We had to make sure it worked with both ecosystems initially. Once the migration was done, we shut down the old ecosystem and only worked with GitHub Enterprise. We also scaled the number of teams. At that time, we had about 30 teams. Today, we have 15 instances, I think, precisely. It became really challenging to keep all these instances up to date. Teams had the rights to add their own plugins, set their own configurations, and sometimes even configure something that broke their own instance. That led to a few fun investigations as well.

Bart: With things like that breaking, what did the team do to fix it?

Stéphane: That was probably the biggest project we had after the creation of the cluster itself. It was everything at the time. I want to stress that we are not sysadmins in the team. We are engineers, and that was really not our first calling, being able to maintain a cluster. We started by having bare bones deployments where we just do a kubectl apply to update things. Doing that for around 20-30 Jenkins controllers became very challenging. That's when we discovered Helm charts. We decided to move everything to a Helm chart first. That's when we also decided to have the Jenkins configuration partially immutable. What's partially immutable? Jenkins loves Groovy apparently because they have a configuration system where, when Jenkins starts, you can configure it by using Groovy scripts. We use that to configure everything that is absolutely required to run a build, for example, the configuration to Kubernetes to start a pod. Or the UI configuration to authenticate users, for example. We made sure that these pieces were immutable in the sense that if you restart the container, it just resets it. For everything else where we wanted people to have access and tweak things, we made sure that we don't touch those configurations. That allows you to have some room for experiments and improvements, where teams can test something and then come back to us and say, "Hey, we tested this and we could add it to the default pipeline. It would be cool for everybody." Another challenge is that Jenkins' configuration is all on disk, so it's not a database. That is also a pretty big challenge because when the Docker image starts, it just copies its own plugins to the disk and then starts Jenkins itself. If you have an old configuration, an old set of plugins, and you start a new version of the image, it will just add the plugins on top, and sometimes by unzipping the whole jars, they will conflict with each other. We made sure that the Docker image itself would bring the latest version of all the plugins we use. We defined a set of plugins that we saw everybody adopted. We made sure that when Jenkins starts, it just has the plugins that it should have and not some leftovers from a previous version.

Bart: So at this point, it sounds like we have Jenkins on Jenkins? Sounds like only ArgoCD would be missing from the mix?

Stéphane: So, we're building Jenkins on Jenkins at this stage. We automated all that, but we have ArgoCD. That's when our sister team, SRE, was created around that time or maybe a bit before. They are in charge of all the clusters for production. So not our cluster, which was built with scraps. They said, "Hey guys, we built this cluster and we installed ArgoCD to configure it, but we can connect it to your cluster. Do you want it?" We looked at it and said, "Sure." On the very first day when we had the ability to do that, we configured the Helm charts we had to be configured from ArgoCD. From that day, we were able, at the end of the build, to automatically update the ArgoCD definition to deploy the latest Jenkins version.

Bart: At this point, you have several teams, Jenkins that deploy Jenkins, ArgoCD, and on top of that, even more. The scale at which you operate is so unique that you must have encountered issues that teams don't usually see.

Stéphane: At this stage, the biggest issue we have is resources in general, not people. That was fine. Every single resource in a cluster is finite. We discovered that they have limits at some point, whether it is network, CPU, I/O, disk space, or memory. We hit all of those limits at some point and had to figure out a way to get out of it. Most of the time, we were pretty naive and thought, "We'll never reach that limit; that will be fine." But we did. For some of it, it was pretty easy. For example, the maximum size of a Jenkins log reached up to a gigabyte for a single build. We limited that to 10 megabytes and told people, "If you go over that, you're probably logging a bit too much in your build." Kubernetes is also great at managing CPU and memory out of the box. That is great. I'm going to go ahead and stop recording. But there is one thing it doesn't do: when you start a pod, it just stays up. For a build, that's maybe not what we wanted. So we created the first sentinel at the time. We're very creative with names. It would shut down the containers after 90 minutes. Your build should be one hour and 30 minutes maximum. If you go over that, something has timed out, is waiting on something, or you're just doing something wrong. The biggest thing that we had and scratched our heads over for years is integration test environments.

Bart: And regarding the integration test, did you run them against a test or staging environment?

Stéphane: So for that, we got really creative. For development, we created a system that we call the sandbox. Again, we are super creative with names. It was actually a tool that would take your Maven dependencies. If you have a dependency that looks like a sandbox, which for us is a development environment, it will add it to a Docker Compose YAML and start it. Each sandbox is tied to a Docker image, and each Docker image then gets a name in the Docker Compose file, and you can address it. For example, your own application will have a link to a database. So we'll start this database. If your application has a link to another application, we'll also start this one. Where it really started to beat us is that it works recursively. If you have an app that starts an app that starts an app and goes all the way down to 60 applications, it will start the 60 applications. This was a challenge on developer machines because they struggle to handle environments with many machines. But it's still very easy to add one dependency and start the application. If you have enough CPU and RAM, it just works, even though we don't recommend it. The challenge with that is it brings a lot of load on the cluster.

Bart: But with using Kubernetes, distributing this load across the cluster is something that you were able to achieve, or did it become even more complicated?

Stéphane: Yes and no. The thing is that Kubernetes is very well aware of everything that is running on it if you started through Kubernetes. But because we were using Docker Compose, our pods were mounting the Docker sockets on them. And we were starting the containers for the sandbox outside of the main container. So we had some magic tricks to bind the networks together so that it works. But what it meant is that all the containers for these integration environments were not known to Kubernetes. So what we could get into is that in Kubernetes, we say everything is fine, but at the same time, it would really struggle. The node would really struggle because there are 200 containers running on it right now and hogging all the memory. So this was a significant problem. There is also the fact that when you run, for example, Docker builds, it will use the Docker socket to start a container. And this is also unknown to Kubernetes. So for all these things, we created a new version of the Sentinel at the time that would try to have a smart way to find which containers were started by which build and be able to either kill them if they were using way too much memory for the quota that we would allow them, or also kill them after 90 minutes because it can happen that we stop a build and the containers for the integration test for that specific build were still running. So we had containers that would sometimes stay for days for nothing. And that Sentinel helped us to stabilize the situation a lot, to get that in check so that our users would get their builds running again.

Bart: Were there any other takeaways that you got from this situation? Yes.

Stéphane: That's when we noticed we made a big error in the design of this cluster. Being able to use Docker outside of Docker, as it's called, was not something we anticipated because we thought it would be okay. Again, naive, not sysadmins thinking there. We thought, what if we try to run Docker in Docker? That's also a technique that exists. It is technically possible. We tried it. But we encountered other challenges. Docker's file system stores images locally and is designed to be run by one daemon. When you start a new Docker daemon, you need to start from either a fresh cache or something not used by another daemon. This means you need to download all your images for Docker to run. This can take a lot of time. For example, an application with 60 sandboxes will have to download 60 images. The first time we tried it, we chose one of the biggest ones on purpose. It took 45 minutes just to download the images in the Docker container. That's another limit we discovered. The network card was undersized on that machine. We realized at that moment that this could not scale. It would not be able to do that. We tried to have a runtime that, instead of starting the sandbox locally, used Docker out of Docker with Kubernetes, creating a short-lived Kubernetes namespace and starting every single pod there. But that came with a big cost because all the configurations were created for Docker Compose for developers to develop locally. For ease of use, we would make that transparent to them in the build environment. Having this second runtime that is slightly different, we never got it right to move all the things possible with Docker Compose, such as mounting files locally. Config maps, for example, have a limit of one megabyte. In some cases, teams needed configuration files bigger than that. This environment never gained big traction, although we liked the principle and the idea because it leveraged everything we already knew. It did not work for us.

Bart: At the beginning of the podcast, we started with the date; we were talking about 2015. At this point in time, how big is the team that's looking after the CI/CD pipeline?

Stéphane: At this stage of the story, we're around the beginning of 2023. Our team managing the Jenkins infrastructure just got its fourth member. The productivity team next to us had four members. The SRE team, who would help us a bit, also had four members. The most important thing about that year is that we had a plan to move out of the cluster we built ourselves because it had many limitations. We were not able to update it anymore because it was brittle. We decided it would be more suitable for our sister team, the productivity team, to take over the management of the Jenkins controllers. They already managed the content of the builds, helped the teams using best practices, and managed the pipeline itself. It made sense for them to have the whole ecosystem, and they also managed SonarQube at the time. They still do.

Bart: With that in mind, the better part of a decade has gone into transferring the code and the knowledge from one team to another. How did you go about doing that?

Stéphane: I don't know how we did that. It was a huge challenge. The first thing is that if we wanted to go to a proper cluster managed by SRE, we had a few things that were just impossible to run on their cluster. For example, Docker out of Docker, because it's a production-grade cluster for a finance company. We can't just have Docker containers running around. So we discussed with them the challenges we had because we still needed Docker, for example, just to run Docker build or to start a test container with a small database or anything. We talked with them, and they came back with a solution and said, for your needs, we think we're going with KubeVirt. It's a virtualization system that allows starting containers on Kubernetes using the same primitives. We thought, great, that's cool. We had some challenges similar to what we had with Docker in Docker about managing the fact that Docker needs to download all of its images. They found a neat trick to have an image containing a set of pre-populated Docker images. At the start of the container, they would mount this image by copying it, and we could start from that and run our builds pretty fast. The second issue was that we could no longer use local storage because Jenkins uses local storage to store its configuration. That was no longer possible because we needed to be flexible. If a node falls down, we need to reschedule on another one. We also exposed that problem to SRE, and they came back using Ceph. For that, we're super happy. It was quite a challenge to set up, if I recall correctly, but now it runs fine. Those were the two main blockers we had to plan for on the technical side. Once we had that, we could make a plan to move from 50 instances of Jenkins on one cluster to another cluster without disrupting our users. To share the knowledge, we decided to onboard productivity and plan the migration with them. We had them onboard so that the knowledge of everything we were setting up was also for them. This way, we could remove the knowledge of all the legacy things or things we forgot in the old cluster as we were going to shut it down anyway. We ensured that everything on the new cluster runs today, works today, and that they need to know as well.

Bart: As a team, you've been working with Jenkins quite extensively for a fair amount of time. How does the new Jenkins on KubeVirt differ from the previous generation you were working with before?

Stéphane: So from a user's perspective, it was mostly the same. There are a few things that change, but from a user's perspective, they still commit their code, get the builds, and that works perfectly fine. The big improvements they saw, us and our users, is that there is proper build isolation because KubeVirt is a proper VM and it properly uses the limits that you set. Even if you have a Docker Compose that's quite big, it will just respect the limits that you set. Because Kubernetes knows how to do that really well. And if you have one build that fails, you're sure at least that it does not fail because another build next to it used all the resources and just killed it. So that is one big advantage we saw with this new solution. We also made sure that for all the resources we have, again, memory CPU is handled, but for all the other resources that we have enough observability and enough information over how far we have from using them all. And we were even able to just shut down our Sentinel because everything was not properly handled by Kubernetes. We had no need to have a second check to be sure that there is any dangling resources. To control also all this, we made sure that each Jenkins controller was on its own namespace. So that means that we can control the resources per namespace, being able to give a bit more to some or to constrain some if we see that they start to have a lot of big builds. We kept our technique for caches, npm maven caches, that worked really well and we kept that, improved it a bit even. And on the things that were more challenging is that the Jenkins Kubernetes plugin, well, it served really perfectly well for so many years. But the thing is primitive is that it's starting pods and KubeVirt is actually not a pod. So what we had to do is to trick it a bit and start a pod that would then run a bash script to start a VM and then shut it down once the pod is stopped. So this works fine and we're happy with that. And the big trade off though that we saw by this solution, but it's very manageable, is that pods start instantly. Given that you have the Docker image on the cluster, it will start instantly and your build will start very fast. But a VM on the other hand will take some time to boot. We did some optimizations and we're happy that it's under 30 seconds. And for that it works well. We had some comments. But in the end, we feel that it's not a big deal. Because if your build is fast, under 5 minutes or something, if you add 30 seconds on that build, that's not too much. And if your build is already 30 minutes, well, 30 seconds is also not that much.

Bart: And in this case, why not use a microVM?

Stéphane: We love the technology. We looked a bit into it. The thing was that at the time, apparently there were some kernel calls that were not working by using Kata, for example, that did not do the trick. We definitely will look into it again in the future. We see that VMs starting in two seconds is very interesting since it's the one challenge we have with the big VMs. But we might look into it in the future again.

Bart: Did the new cluster in Jenkins provide further room for optimization?

Stéphane: One of the big advantages of having a properly managed cluster with all the bells and whistles is that we also get Prometheus out of it for free. We were able to create quite a lot of dashboards from the raw information we get from KubeVirt and the cluster itself. We have a dashboard for the general cluster health from the KubeVirt perspective, showing how many VMs we have running at the time. We also have a cluster per team, where we can see how much resources they use historically or in real-time. At the end of a build, we add a link to a dashboard specific to that build. This is very useful because we can see for a single build how much network, IO, memory, and CPU it uses. We even have some basic calculations that indicate if you are not using enough of your resources. So you could request fewer resources. We created a set of profiles: small, medium, large, and extra-large. If you reach the limit of a smaller profile, you can use a bigger one. But if you use one that is too big, you can switch to a smaller one. This is very interesting for us because we believe that if you give resources to a team, they should be able to control how much they use. They could either have a few very large builds or a lot of very small builds. It is up to them to optimize that. We are really able to provide that to them out of the box.

Bart: Now, this was a pretty significant change, whether it's in a large organization or a small one. How did you keep teams invested in this? Because one thing can be the technical aspects, but people really have to believe there has to be buy-in, there has to be commitment. How do you engage and persuade the business to respect this on the one hand, and then... to further invest in.

Stéphane: On one aspect, you mentioned business. That was not an issue for us to get them on board because we have a budget inside the department to take on these kinds of technical projects. That was an easy sell for them to say, well, we are going to tackle this project and we will make it as transparent as possible. This is the other aspect of how we get teams engaged. We really try to make sure that it is for developers. We try to make their lives easier, not ours. We wanted to be very transparent about what they are going to gain from this, how we are going to make it, and how we are going to accompany them in the process. We advertised the improvements that they would get out of it. We took the time to explain the different steps that we're going to take, when they were going to start, if they have a problem with that or a deadline or something, and they absolutely cannot risk that their Jenkins might be down for an hour or so, so that we can migrate the data. They can come to us and say, well, maybe just do that at another date. We took the time to really do that with each team.

Bart: Did you get any pushback? Was there any resistance? Did you have to work around any of these? If there was any pushback, how did you handle that?

Stéphane: Honestly, not a lot. On one aspect, our management was very supportive of this because we explained to them that we have this cluster that has been running for almost 10 years. Machines could break any day. That is a possibility. If we had to take this machine, rebuild the cluster, and make it work again, it could take up to a week. We're not experts in that. It can happen. The other option is that we bite the bullet now and do the migration to a proper cluster. Sure. With the guarantees that it's going to be up, that we know how to rebuild them in case of a huge failure, and that we are going to have some guarantees on the uptime of that. So they were really supportive of doing that migration. They supported us when we had to discuss with our team's management to say we have to do it. It's not against you; it's for you. The one thing that was challenging was moving the builds from Docker out of Docker to KubeVirt. All the other steps in this migration, like moving the actual controllers to the other cluster, were fairly transparent. Moving the data to Ceph, some well-placed RCings really helped there. That was really transparent. But moving the builds, we made a lot of tests first on our side. We onboarded some pioneering teams to ensure it also worked for their cases. We were able to do it build by build first. We added a second cluster in the Jenkins configuration and specified for each build on which cluster it had to be built. They were able to try the new cluster. If it worked, they could stay there. If it didn't, they could roll it back. It's a one-line configuration change. For the ones that really did not build, we took the time. It was less than 5% of our builds out of about 2,000 active repositories. It was not a lot of builds that failed. Our productivity team did a tremendous job. They monitored all the builds that succeeded before but failed after the migration and checked one by one what needed to be fixed, whether in the common libraries we were using for tests or specific to their build.

Bart: Is there anything that you could have done differently?

Stéphane: Resource management. So the one thing that we made a big mistake in thinking that it would just work out because it's Kubernetes is we really needed to think about every single resource and how it could fail. That is something we did not do at the beginning because we did not think it was a thing. The blast radius that we could get from one build exploding was huge. Today, it's not the case anymore. But we should have thought of that in the very first step. So that's what I would do differently. The other thing is managing our own cluster. It's fun and all, but it's really not the thing we do best. Having alerts in the morning say, well, the node reboot did not go well. That's not the most happy we were, but now it's okay. It's handled in another cluster and we know if it's a failure, the people who do that know exactly how to get it back up.

Bart: After doing all this, what's next for you?

Stéphane: A lot of things. On my side, not a lot of Kubernetes because our team will shut down its own cluster and benefit from the one that we have set up for us. My team is doing some architecture work. We are starting a project about the qualities and maturity of architecture for the teams to assess their own architecture. We are putting the finishing touches on a project to update all our internal libraries to support Jakarta and be ready for Java 21. That's what's on our plates right now. On the SRE side, they're experimenting with some fun stuff. They are trying to use eBPF. They are making progress. Using eBPF for automatic mock generation, being able to use the service mesh to listen on that and maybe replace some of our sandboxes with it. They are also experimenting with other eBPF solutions, such as agentless observability. On the productivity side, they are looking at how to maximize Docker image cache hits.

Bart: What's the best way for people to get in touch with you?

Stéphane: I regularly check my LinkedIn and Twitter, or X, depending on the name you use for that. Also, my email is fine. I do answer that sometimes.

Bart: Cool. Stéphane, I really want to thank you for your time today. I appreciate you going deep and explaining all the things that went into this process. Your background is someone that doesn't come from this ecosystem, yet all the things that you've done show a substantial amount of knowledge and experience. I really want to thank you for sharing with us.

Stéphane: Thanks for having me.

Bart: Absolute pleasure.

Listen anywhere

Kubernetes experts reacting to this episode

Building developer platforms: Tools, practices and the AI evolution
with Graziano Casto
Taming tool sprawl: building mission-critical platforms on Kubernetes
with Karthik Ranganathan