Transparently providing ARM nodes to 4000 engineers

Jan 30, 2024

Host:

Bart Farrell

Guests:

Miguel Bernabeu Diaz
Thibault Jamet

This episode is sponsored by LearnKube — become an expert in Kubernetes

On average, Kubernetes nodes running on ARM instances are 20% cheaper than their AMD counterpart.

Optimising your cloud bill is tempting, but how do you seamlessly migrate existing workloads to a different architecture?

And how do you do it at scale, with more than 4000 engineers and 30 clusters in 4 regions?

In this episode of KubeFM, Thibault and Miguel explain how Adevinta built an internal platform on Kubernetes for mixed AMD and ARM workloads.

You will learn:

The challenges they faced with validating containers for mixed architecture with a mutating webhook and the open source solution they came up with: noe.
Building an internal platform requires careful planning and designing simple interfaces that are backwards compatible.
How to not DDoS your container registries.
How to onboard users to an internal platform and evangelise it.

Relevant links

Transcription

Bart: In this episode of KubeFM, get a chance to speak to both Miguel and Thibault, who work in a company called Adevinta. They were involved in the project of building an internal developer platform called SCHIP. And on top of that, switching from AMD to ARM. Not necessarily the easiest task, involves a fair amount of planning, making sure that stakeholders'needs are being met, and as always, balancing the technical with the human side. Oftentimes, we may be speaking about technologies, but at the same time, a lot of it's about change management. and making sure that everyone's interests are being taken into consideration to make clear and balanced decisions. That being said, before we get into the episode, we'd like to say thanks to our sponsor, Learnk8s. For those of you out there who are trying to level up in your Kubernetes journey, in your career, knowledge, and expertise, you can check out the courses on Learnk8s.io, in which you will see different things that you can learn about Kubernetes in an environment where it's 60% practical, 40% theoretical, instructor-led courses that are available in groups, as well as private forms of instruction. you'll have access to all the material for the rest of your lives so you can squeeze all that wonderful knowledge and improve your skills. So like I said, check out Learnk8s for more information about the online in-person courses that are offered there. Now, let's get to the episode. So as always, our first question, it'll be interesting to see the comparing and contrasting between the two of you. If you had a brand new Kubernetes cluster and had to install three tools, which three would they be? Let's start with you, Miguel.

Miguel: So my question, I think it's a bit boring. It's the typical things. One of the first things is cert manager because otherwise it's a, it's a great way to get certificates in your cluster. Another one is Prometheus operator. I'm very partial to Prometheus most of the time. I want to try new things, but so far Prometheus has been through and tested. and KEDA for autoscaling based on several metrics. That would be my go-to first thing in a new cluster.

Bart: Very good to hear that. We add the KEDA maintainer, Jorge, on one of our previous podcasts. He'll be very happy to know that for Kubernetes event-driven autoscaling. Thibault, what about you? Which three tools would you install?

Thibault: Okay, so for me, the first one would be Argo CD because I do everything GitOps. Then I would go for Cilium. because eBPF. And I would go for Traefik. But obviously, we're super complimentary, right? Because everything that Miguel said, I would also install it.

Bart: Okay, good. All right, but nice to hear different perspectives. Thibault, I just want to ask really quickly, since you mentioned Argo, have you tried Flux? And if so, what are the things that you find in Argo that give you an extra edge that you wouldn't have with Flux?

Thibault: I didn't try Flux per se, so I don't know exactly the experience with Flux, but it's the one that we're experienced with and, you know, I'm feeling comfortable with.

Bart: Makes sense. Just like what Miguel said regarding Prometheus. Like, you know, we say if it ain't broke, don't fix it. Good. Now, quick introduction for the two of you. Can you just tell us a little bit about what you do and where you work? Miguel, we'll start with you.

Miguel: Yeah, so we both work together, same team, in Adevinta. Adevinta is a bunch of classified marketplaces, mostly based in Europe. And what we do is that we work in the common platform runtime team, which is tasked with providing Kubernetes as a service platform for all the workloads, or most of the workloads in the company. And Thibault, anything you want to add in terms of what your role is?

Thibault: Yeah, so basically I'm supposedly the product owner of the team, right? So that means dealing with understanding what are the customer needs, our users needs. Our users, they are basically the developers of the company, but also some platform team that specializes part of the integrations to their developers. So we have like different, different kinds of users. understanding what are their needs, what are their pain points, what we can do to improve their daily life, and trying to translate it into what we can achieve and goals that we set to ourselves.

Bart: Fantastic. With that in mind, the two of you are at this point in your career, what were you doing before Cloud Native? What was the process of getting into these technologies?

Miguel So for me, it was... I changed the company, actually. So I was working before in high-performance computing with data centers and the whole thing. And then I moved to a company that had everything in AWS. It was not really cloud-native, but everything was already in AWS. And we had needs to deploy faster, to get more reliability, get better scalability. we iterated time and again and ended up in a cloud native approach.

Thibault: Oh, I have a very different background. So I used to be an embedded C developer, right? So the big word, C. And because of C, I was offered to move to data centers to join basically at Adevinta through Le Bon Coin, which is the French subsidiary. And from there, I was a backend developer in Le Bon Coin. Overall, bit by bit, I was interested in the infrastructure, how it was working, and what's on. I started to debug Jenkins with Docker, and we virtualized, isolated the environments, and this is how it started. And it ends up basically in the runtime team doing Kubernetes.

Bart: Okay. With that in mind, too, we often find ourselves in these situations for different reasons. In terms of your experience, your learning journey with Kubernetes, what are the resources that have been most helpful? Blogs, videos, documentation? Thibault, what works best for you?

Thibault: Yeah, so basically, usually I follow several people and threads on the social networks, right? So basically LinkedIn and X, or previously Twitter. And Learnk8s is one of them, obviously. And basically, this helps me... stick with and keep up to date with what we have in the landscape. Also filtering and searching on Google and GitHub about very specific needs that we have at a given point in time. This helps me basically understand the landscape and being up to date. So a bit of passive where I filter on and... get some info, some newest information, but also some active based on needs, right? So actual things that we need to do.

Bart: Fantastic. Miguel, what's your approach?

Miguel: It's quite similar, but I don't use X, for example. So I tend to look for blogs. I also look at the KubeCon rosters, the list of talks and pick up titles that seem interesting or the people that I know the name from the community that... Yeah, probably going to tell me something very interesting about it. I followed the last week in Kubernetes development newsletter as well to be up to date to the most recent things. And if I have to search for specific things, documentation and the GitHub issues mostly to find places to go.

Bart: All those sounds like, yeah, good combination of best practices. Knowing all these things, if there's anything that you could, you know, advice you could give your previous selves in terms of career advice about things that would help you level up or things that would be better to focus your time on, what would you say, you know, to all in your experience previously as a C developer, what are things that you would say like, you know what, if I could go back, I might have done this differently?

Thibault: Well, take it easy first, right? So life is long and you don't have to sprint on everything that you do. And it's not all about tech, right? There are many things apart from tech that you need to be in. And if you take it easy, you will realize it and you will basically enjoy it more.

Bart: Really good advice. I love that. Miguel, what about you?

Miguel: Well, I agree, but for me, there's so many more things than tech in it that it doesn't matter that you are not in tech because I was not trained in tech. I'm a chemical engineer from training. And it's also don't undersell yourself. You can learn these things. You can get into these things. Take it easy and look at the foundations. Good foundations and especially network. Find someone that is already in, that has much more experience than you and that can help you to fill your gaps.

Bart: I love that. I have really, really good advice. I must say, like, we asked this question to all of our guests, and this is by far the best response we've ever gotten. So this, like I said, for recording purposes, this will be used as a separate clip. We could do an entire episode just about that. Also coming in from different backgrounds. My background is completely non-technical. And so, yeah, I very much agree with the networking, you know, finding people that are more advanced than you. taking it easy. It's a marathon. It's not a sprint. Some of these things cannot be learned in two weeks, no matter what people try to sell you. So it's very, very good advice. Cool. Now we're going to be getting into the main topic, right? So we found, in terms of our content discovery efforts, we found this article about transparently providing ARM nodes to 4,000 engineers. So at your company, you're part of the team that's maintaining SCHIP, S-C-H-I-P, an internal developer platform. Can you tell us about it?

Thibault: Yeah, sure. So SCHIP is basically, it's part of a bigger platform, the actual internal developer platform that we name Common Platform, that gives the name to the team, right? Common Platform Runtime that Miguel mentioned. Right. And basically the goal of the common platform is to provide a fully integrated experience to our developers. So it's easy to develop and ship code to production. Right. So they can focus on the business goals that they have and they don't have to worry about the internals about Kubernetes per se, for example. So they speed up their experience there. About this, we have our former colleague Galo, who wrote a blog post that is super interesting. It's how to build a pass for 1,500 engineers. This was three or four years ago already, and we scale a little bit more than that now. So it's still good things, and the roots of this blog post are still inside the DNA of SCHIP. So we focus basically, the platform, we focus on gluing things together and we provide a golden path so that it's easy to do things. But also what we call the escape hatches, where basically you can go your own way if the provided, fully abstracted, fully golden path doesn't work for you. So think about it like the highway. On the highway, you can get the highway. I'm currently based in Barcelona and I often go to Paris. I can take the highway from Barcelona to Paris. It's super easy. But if I need to go for a pit stop, I can get away from the highway. And that's not a problem. The highway, I can take it a bit further. Yet I benefit from most of the highway, right? So this is a bit what we are focusing on, right?

Bart: Yeah, no, that's great. That's great. Just to take a little bit further though, you're like, for companies that are out there understanding, you know, the steps that have to be taken, investing in a project like this, how has that process been? Like, you know, getting participation, stakeholder alignment, what's that been like? How has that, what's your experience been there?

Thibault: It's been a long, long journey, right? So it all started. So it was 2016, so that's almost nine years ago, right? And basically by then, we had a team that was maintaining a platform for machine learning jobs that was running on top of Mesos. And they saw that basically there was a rise of needs for people using Kubernetes, right? Using Kubernetes for production. And the same thing happens, right? It's, okay, why people are running their own Kubernetes on their side? It's like duplicated effort, right? So on one side, it's good because they can do exactly what they need, right? And this was the focus of the company at that time. But at the same time, the effort is duplicated, right? And this is something that we aimed at solving back in that time. And we started basically this. SCHIP project to address those needs. And it's only two years later that we actually started to start the Common Platform. So we started the Common Platform two years later. With the idea of gluing things that were already existing. So we had Kubernetes, so SCHIP, we had other pieces like CI/CD, but nothing was really integrated. It was like pieces of infrastructure of software that your tools that you could use on their own, but nothing really coherent across them. We started basically gluing things together with an idea, which was the common platform idea. And from there, we started to have one specific marketplace that we wanted to onboard and that was willing to onboard basically a mutual thing that was Italy. So our marketplace, our colleagues in Italy, they started to onboard this. And this was the start of the journey towards the modern SCHIP, right? What SCHIP is today. We onboarded them. It took us about a year to onboard completely this marketplace. Right, so we had a lot of rough edges, things that were not as polished as we would like. But yet we made it in one year, one or two years. And then around 2019, 2018, 2019, I would say, we onboarded another marketplace, the Spanish marketplace. And there we finally migrated. It's a bigger marketplace than the Italian one. More brands as well. So we had to have more sites, more team members. We onboarded them in about a year. In about a year, we migrated most of their workload. At the same time... We had to do some upgrades, right? And we were at the beginning, EKS was not a thing or not usable for us. So we went into 2016, right? We went for QB AWS, which was the best way to deploy Kubernetes on Amazon for us at this point in time. But for this, the in-place upgrade was not an option, right? So we had projects to migrate and this took us like... with five customers, it took us quarters to migrate, a lot of synchronization and what's on to upgrade to a newer Kubernetes version. So we also started at the same time that we were scaling, we started this process of being able to upgrade basically all our customers. And at the same time, QB AWS got deprecated and they actually completely stopped the development of QB AWS. So we looked for a solution. And this solution was actually EKS. So we migrated to EKS as a new upgrade. It took us with probably five times the number of customers that we had. It took us even not a quarter. Maybe a bit more, right? But it took us roughly the same time, but with more people onboarded, right? To migrate completely to EKS. And less noise, right, as well. And that's pretty much it. Right now, we are running around 30 clusters. We are running in four regions. So from what we started, one cluster, one region, one Kubernetes version, no upgrade. And we are now running in four regions, several clusters, around 30 clusters. And we provide implicit rates that Also requires a certain number of tooling, but this could be probably the topic of a future podcast.

Bart: Okay, good. It would be wonderful to continue the conversation. Now, you mentioned cost savings as being one of the reasons why SCHIP is such a successful project. From things that we've seen talking to other folks in the ecosystem, a lot of times cost can be associated with computing workloads like EC2. Is that also the case for Adavinta? Miguel, do you want to take that?

Miguel: Yeah. So in our case for our platform, as Thibault was saying, we have about 30 clusters, that's thousands of nodes. And that means that we have a lot of machines running and they tend to be beefy machines because we pack a lot of workloads into the same machine. And that makes a quite hefty bill. So one of the, we are always looking into ways that we can optimize costs and reduce the cost other than just having a single platform instead of repeated platforms for every team. And one of the things that we noticed, and we think it's also trend in the industry right now, is the move to ARM-based instances, because these instances are cheaper. And that's... One of the projects that we tried to do to save cost was to try to make it seamless for our users to start using the ARM nodes so we can reduce the cost of the platform without requiring developer involvement in that migration.

Bart: With that in mind, is it as simple as flipping a switch or pressing a button and all the instances go from AMD to ARM and you just save tons of money, right?

Miguel: Well, it's slightly more complicated. That's what our end goal is, that developers feel that's like that. But before that, there's a lot of movement we have to do, of steps we have to prepare beforehand. One of the problems is that most workloads are compiled or built somehow for specific CPU architecture. ARM and AMD64 are completely different CPU architectures, so the binaries are not compatible. If you have a Java machine running for AMD, it doesn't work in ARM. So the first thing we have to do is to build our pipelines, our build pipelines, where we containerize the workloads. So they are multi-architecture, and they are built all the time, so all workloads get built for AMD and ARM. Once we have them like that, we can, at the moment of scheduling, select which version of the same Docker image we want, if the ARM or the AMD, depending on what hardware we have available at the moment. So we can start having hybrid clusters where some of the nodes are AMD, some of the nodes are ARM. And as more teams onboard on these multi-architecture pipelines, the shift between both architectures is completely transparent to the user and it only happens in the runtime. phase of the execution.

Bart: From what you said previously, you know, the context of the different migrations you've been through, you know, we go as far back as talking about Mesos, which is a nice field. We realized like, oh, wow, that was a long time ago, but in terms of planning a migration of this, you know, scale and complexity, what, how does that look like? You know, for, for people that might be thinking about something being in a similar process, what was that like in your case?

Thibault: So actually we optimize for reducing planning, right? And this is the whole strategy that we have applied. And this is what we wanted to explain or so in this blog post that we mentioned before, right? So we wanted to reduce planning because with that amount of developers, this would mean synchronizing with every one of them. And this would have a super high cost, right? At the same time, we wanted to have autonomy to be able to provide it, to provide the opportunity for developers to actually optimize their cost. And to split into smaller pieces, right? If we go for a full migration, then yes, the first part that we need to do is to actually have all the images compatible with both architectures, right? We didn't go that path. We said, okay, we enable it. We will enable people to use ARM images so that then it's easy for them and they can do it pretty transparently. One of the things that we observed is that when you actually, in standard Kubernetes, in the plain Kubernetes, when you want to actually go for ARM, then you need to repeat yourself. You need to say, okay, I need to build this image for AMD and ARM, and I need to say to my pod, please go to AMD or ARM. This is what you actually do. So what we went for is a clean interface and a small interface. Don't repeat yourself. Right? So if you say that in the, in the image, then we will consider it super simple. So, uh, we wanted to be conservative because we had a lot of people running on top of our clusters. And from an experience that we had and the experience that we had from all the part of the company, we knew. that sometimes, for example, in Go, on some version of the compiler, on some version of ARM, it can be that specific functions like cryptographic functions are slower in ARM than in AMD. And with this feedback from other parts of the business that was not on the platform at this point in time, went for this, discovered this pain point. So we took this into consideration and we say, we will be conservative if you have both and you didn't say anything, we will go for the legacy behavior, right? And to be even further conservative, we provide what we call good defaults, right? So that means that if the image... is not a multi-arch architectural image. You didn't build it with the Docker manifest in this case. What we will do is that we will select the legacy behavior. So that means AMD. So all together, we provide something that is backward compatible, small interface, but yet we offer the possibility for people to force their their architecture with a super simple interface, which is a label. Why a label? Because basically most of our users, they are deploying through abstraction layers. And again, if we want to optimize for providing and for autonomy and providing this feature into production pretty quickly, That means that we would have to change the abstraction layers to actually include it. That means changing several hem charts, some in-house developed or shared, in this case it's shared with other companies. abstraction layers named FIAS in this case, right? That would, we would need to agree. So that slows down the whole process and what's on. And we wanted to just have the autonomy, keep it simple, ship it, and then observe the results, right? That was the approach. So the least planning possible, the less planning possible. but the safest possible. And then as an extra step of care, what we did before actually going to production is to run several times some dry run. So because of the nature of Kubernetes that is descriptive, we can actually list everything that is being used in the clusters. and understand how what we are currently going to deploy, what we are going to release, is going to behave. And we say, okay, this is good for everything that is in the platform. We would select AMD. We can go for it. If we had cases where we would not be sure, there would be errors, we would need to go for this process again and ensure that with what is in the platform currently, We select what is running.

Bart: A lot of planning, but it sounds like, I feel like the two of you could easily write a book to help other organizations through these problems because they can be quite overwhelming. You mentioned an element of risk and also about being conservative. So asking every team to choose a node selector to run their workloads sounds risky as they might forget to do it. As much as we're talking about technologies or human error and people might forget, what's your experience been like there?

Miguel: So yeah, it's, it's very risky. It's, it's risky in many ways because when they are trying to migrate, they may change the build of the image, but not change the node selector, or they may find a bug and they revert. And in that revert, they are reverting the build, but they are not reverting the node selector. And now the pods don't start. They fail because they it's in the wrong architecture. And then you, it takes a while to figure out why is this new error? and this delays the release of features and stops the whole pipeline. So what we did is that we created a mutating webhook. So this mutating webhook called NOE is installed in our clusters. And whenever a pod gets in, it checks if a node selector is defined. It's defined, you know what you're doing or you should, so we will honor what you're doing. But if you don't specify a null selector, then we will inspect the image that you have provided. And depending on what the manifest of the image tells us, we will inject no selectors and no affinities according to the architectures that you support. We will check all the containers that are running in your pod, including the startup containers, because maybe your startup container is only supporting one architecture. Even if you build your other container for one, you may have sidecars. like Datadogs type car or some other observability type cars, and those may not be ready to run in ARM or may not be ready to run in AMD for some reason. They were misbuilt. So we will check that everything has a common architecture, and then we will select, we will inject the node selectors and node affinities adequate so your pod can run. Because our most important guarantee is that your workload as a developer will run in the cluster. And then, as Thibault was saying, to the risk, the migration, and for the initial phases, we select AMD by default. Our goal is to eventually be able to have so many teams that have selected that they prefer, and that's important, they prefer to use ARM instead of AMD, that they have overridden this setting. So we can change the full behavior and stop being so conservative.

Bart: Wow. One thing just to clarify, you mentioned NOE. Is it open source?

Miguel: I think it was linked already in the blog post.

Bart: Good, good, good. Yes, you're right. You're right, you're right. What's wonderful about these conversations is that it's not just hearing from somebody. Folks can go out directly and use that and get that experience. Speaking about the webhook, though, the webhook was great for migrating regular deployments. What about daemon sets that run on every cluster? Because they can't run on ARM nodes if they're built to run on AMD nodes. So what about that?

Thibault: Yeah, that's a good point. Here we're lucky because we are the only ones managing daemon sets. Right. Our users, they don't manage daemon sets because of the nature of our platform, right? It doesn't really make sense for our users to run daemon sets. So we are the ones running daemon sets. So we had to actually do this work. Fortunately, after some trial, we realized that most of the community actually did go. for multi-architecture builds, and actually way more than just ARM and AMD. Many are using fancy architectures from their own data that no way is giving us. That's one. And the second one is that we had some daemon sets, actually two. I just got the numbers to check, and I thought it was more than that. So actually, two daemon sets were not compatible, were not multi-architecture. So for those ones, what we did is to actually build it and repackage it ourselves. So how do we repackage it ourselves? What we do is that, or what we did. is that we didn't fork, because then in the fork, you need to maintain, you need to update, you need to do all of this, and then you need to fetch, push. It's a complex process. You need to track the branches. What we did is to have patches. So we have a patch that enables multi-architecture build inside the Dockerfile of those forks. Fortunately, they are not of the force, of those two daemon sets that are namely KIAM and Kubernetes in the version that we're using. They are not compatible with multi-architecture. And for those, we have a patch that we apply that just mutates the Dockerfile so that it works in a multi-architecture build. Right? What we do in this is that what we did also discover is that the multi-architecture build from Docker, the plain cross-compile from Docker and emulation from Docker is not as performant as what the Go cross-compiling is providing us. Right. So it was orders of magnitude. Right. So the Go cross-compile is order of magnitude faster than a Docker emulation bit. Right. So what we did. is that we went for cross-compiling at the goal level. So we have multistage build, basically. And we used the target platform and build platform, if I remember correctly, argument from Docker. That allows us basically to do the cross-compile build efficiently. And then on the last stage, we just get the output, the binary. Fortunately, those two are written in Go, so it is possible. For things that are not written in Go, it's probably harder. And then we had some of our own daemon set, because we have some daemon sets. And for those ones, basically, we did adapt also the Dockerfile. But we were already using a centralized CI script. that helped us basically to minimize the cost of the migration, of this migration, of this movement towards multi-architectural build, you can think about it about a reusable GitHub action. And we just added support. This reusable GitHub action just detects whether there is or not the use of the argument target platform. If we have the argument target platform, then use multi-architecture. If not, don't use a multi-architecture. And that's it. We just had to do those two or three things in our own code base and we had support.

Bart: Were there other lessons that you learned from this migration?

Thibault: The first thing that we realized and that we learned is that at the beginning, when we started developing the webhook, the webhook started to deny. Right. For reasons. Because there was a bug or what so. But we had the webhook in a mode that ignores failures. Yet the bot couldn't enter the cluster, right? This created an incident. So what we learned is basically that the failure policy, it's only if the webhook is not available. But if the webhook says something in Kubernetes, if the mutating or validating webhook says something, it will be applied, even if it's denied. That's one of the big failures that we had. together with some rate limiting and latency problems from our Docker registries. So we know it is actually clever and doesn't pull the whole image because, of course, this would be gigabytes of data. So we're not doing that. We're inspecting the manifests. directly in Noe. So that's smaller data. But yet sometimes the latency is a bit high and we have rate limits on the registries like Docker or our own internal registries. And from time to time, we have DDoSed our registries, right? So for this, we have implemented a catch. What we also learned is that sometimes the webhook can't be done. It can't be done because it has had a problem. And this means that an image could be scheduled on a node that doesn't match. Or it can be done because it's not yet deployed. So in the context of a cluster bootstrap, because all of our tooling actually relies on this, we had pods scheduled on the wrong node. For this, what we implemented is a mechanism that on the first... startup, we will list all the pods and check whether it's scheduled on the relevant node. Here, we faced again the DDoS problem to the registry, or the rate limiting to the registry, because all the pods that were actually running were pooling and trying to get the image manifest. We DDoSed our registry again, so twice. What we learned from this is that yes, when you write a webhook, don't expect that everything will enter the webhook until you force, unless you force it. And even don't expect that everything will be applied with your webhook. Check it afterwards. And when you start and apply and delete or apply your webhook on the resources that were there before that were injected, created, updated when the hook was not there.

Bart: And the teams, how do the teams cope with this migration? One thing can be the infrastructure, but I'd imagine that some applications needed tweaks, adjustments to run on a different instruction set.

Miguel: So here we are lucky as well. Because one of the main philosophies that we have inside the tech organization in Aventa is you build it, you run it. So the teams are owning the requirements for their execution, and they are owning to adjust the applications. The way we designed the migration plan for this was to enable these teams. So it's basically that these teams select when they want to do it. We announce them, we tell them, we offer them help. We provide support to all these teams and we can work with them if there's something to adjust. but they choose when they are ready and when they have something to test. So when they have something to tweak. And then we provide testing facilities so they can test beforehand and ensure that there's no problems. The other thing is that we have a lot of workloads that are written in Go. and Go is mostly running equally well in AMD and ARM. There were some problems with cryptography in the past, like Thibault was saying, but nowadays it's working seamlessly, more or less the same. And we also have a lot of Java. And once you have built with the proper version of the Java runtime, it's not a big issue to migrate the workloads. So far, we haven't faced a lot of issues. If you are using other languages and other runtimes, you may have to work with support. It depends on the maturity of the ecosystem. Luckily, we didn't have to deal with that part a lot.

Bart: Good. So you said not too many issues, but I want to know, you know, was this migration successful? What's the feedback, you know, from the business, from the end users? What would you have to say about that? You know, a lot of work went into this. What's been the reaction?

Thibault: So before then, we had several people asking for ARM, right? Because Miguel mentioned you build it, you run it. But we added one, which is you pay for it. So we are not accountable for the cost of your services. So for the cost efficiency, because it's not only the pure cost, it's the cost efficiency that we're interested in. The developers, they are accountable for the cost efficiency of the services. What does this mean? This means that they have interest in going for ARM when they can, because then they reduce the cost of the services because of how we expose the costs, basically. That means that... Yes, we had people asking for this, right? We managed basically to prioritize it because we had... This team that gave us some lesson learned about migrating to ARM, they were also running on Kubernetes. And we were also thinking together with them that it would make sense to converge our Kubernetes clusters. So we prioritized this project because of this, mostly because of this, and some customers that were actually willing to use ARM. Business was actually, for them, for this other team, business was pushing for migrating to ARM. So business was happy because they saved costs on their side. For the rest of the business that is running on AMD, that has some other priorities because they need more speed to develop new features, whatever. They were so happy because it didn't change anything for them. They could keep business as usual. They could focus on their features. And we didn't disturb them a lot at all, right? Not even a lot at all. Well, apart from when we had those incidents that we were mentioning, right? So sometimes things can fail and it happens, right? Obviously, it happens. So when this happened, this is the time where we actually had some negative feedback from our users. But since then, since we introduced the feature, We have very little bad feedback that is reported. And I would say, is it successful? Well, when we introduced the feature, we had less than 2% or 3% of the nodes that were actually ARM. Now, which so we introduced the feature for our users in June, and now we are at 20% of the nodes that have ARM. So That's a huge increase. Unboarding those teams that we were mentioning helped us. We also support the legacy or the AMD, let's say, architecture. So that means that we are backward compatible because we absorb the toil the most we can. This is one of our goals, to absorb the toil. That's what makes the business happy. as well.

Bart: And like you said, with any migration, you know, we can think about it almost like an SLA. There's going to be some things that are not going to work. It just, you know, giving enough. ahead of time saying, you know, there are going to be some bumps in the road. Of course there are. But the planning, the dry runs that you mentioned previously as well, too, I think this is all very good insights for teams that might be out there thinking about, you know, going through a similar process and understanding, you know, that this can be done and, you know, the transparency and communication with the different stakeholders about what they can expect in order for there not to be any crazy expectations or disappointment. In hindsight, would you do anything differently if you had to go back and do it all over again?

Thibault: Well, probably when we enabled it, between the last dry run that we did and the time that we enabled it, there was some new bots that I think that there was some invalid results that were reported. So we had some multi-architectural manifest, but with no valid architecture in it. So probably we should have detected that before, right? So going with the dry run more. And this also means that probably we should have go a bit slower in how we enable it by probably denying those pods at first, right? So ensuring that all the pods that enters the cluster complies with what we need. complies with the interface that we need. And we are able to detect an architecture. So we deny, and then we are clear on the interface. If you don't do anything, this will break. So we deny you, because we know that this will break. That's something that we would have done differently, I think.

Bart: Miguel, anything you want to add there?

Miguel: Yeah, I think another thing that we didn't do is take a good look at the registries, at the Docker image registries that we depend on. And this is something that we found later when we started DDoSing them and with the latency. And we could have used a bit more time to make sure. that instead of getting them for granted, we were assumed that they are not registered, they are there and they are going to work and they needed, we needed to be a bit more careful with them at the beginning.

Bart: Now, in the beginning, we talked about, you know, the non-technical side of a career in tech, you know, getting a migration like this moving requires a lot of conversations and convincing, making sure that people, you know, understand how these things are going to work. How do you convince the teams, you know, and internally the business to migrate to Gravitron instances? What was that like? You know, the powers of persuasion, showing people a brighter future. What was your strategy there?

Miguel: So we... We had already demand for this, and we were lucky that other parts of the organization had already done that were not running on ship, had already done the business case for themselves to migrate to ARM. So we had a solid foundation that this feature was desired and was useful. And so when we implemented it, we reached out to the people who had requested it first to be early adopters. And then we can use their own savings as they are as own as the developers own their own budget. And they have to demonstrate the cost efficiency of their workloads. We can use those early adopters as business case to show other people, hey, you need to shave your costs by this much. Look, these people already did. And they did it like that. And it was just an annotation in a namespace. So, yeah. If you wanted those savings. Just put the annotation in the namespace. Let's do a dry run. Let's do a try. If you have problems, reach out to me. We need to do a bit more of evangelism internally for this because the approach of just enabling them, of removing all the barriers for adoption, means that the people that are not using it is the people that have not realized how useful it is to them. Or maybe it's never going to be useful to them. So then they are not prospective users and we don't need to migrate them. There's people, there's workloads that work fine in AMD and maybe they cannot be migrated to ARM. Maybe they are legacy applications that have no development anymore and migrating them doesn't make any sense. Okay, that's fine. It's the team's trade-off, the team's decision. If they need to save costs, Or they need to maintain stability or not touch that piece of code that is working right now. And that's the beauty of this approach for us, because we don't have to make those calls. They make their calls and we just help them to find all the options and all the tools at their disposal, putting all the tools at their disposal for solving this.

Bart: Wow, fantastic. Like you said, the combination of there was demand for it, and then there are, like you said, there are going to be some cases where it's out of sight, out of mind, not relevant, so don't have to convince those folks. Others will say the expression, don't knock it until you've tried it. Say, look, if you don't like it, that's okay, but just try it. And then the feedback there is very valuable. But I understand the case that, like I said, I think it's beneficial for folks out there that might be in a similar situation of understanding. How do you build a coalition of people that will be enthusiastic about this and be willing participants? Anything you want to add there, Thibault, in terms of learnings on the human side?

Thibault: Well, I think most of it is safe, right? So our users, they are interested also by the framework that the company is giving them, right? It's not the classic legacy. The developers write their code and then they ship it over the world to the SRE team that will run it in production. You build it, you run it, and you pay for it. So that means that... The framework that the company provides already gives incentive for our users to take a look at this. To say, okay, how can I be more efficient in the way, cost efficient in the way I develop? And this also pushes and this removes this burden of chasing our users and say, hey, did you take a look at? And as Miguel mentioned, right? In a company, there are a lot of times where you have a legacy application or something that you still maintain, but you go for the lowest cost possible. So low cost maintenance, because you know that you're already building the next thing. So as a platform team, we should help the company to build the future. and not push burden on the company to actually maintain and increase the cost of maintaining the past. We should take a look at the future. And I think that this is what we did with this migration or this support for ARM, because we named it support for ARM.

Bart: Really solid advice there. And I really like the point. I think it's applicable to all companies about creating a culture of responsibility. And you build it, the ownership aspect. And so that people will really have more of an attachment to something and not just, oh, it's going to be somebody else's problem. And I say that because I've seen it directly in organizations and watching friction build with teams because of this idea of what's not my responsibility. It's going to be your problem. And that does not. It's not very conducive for an environment if you want people to really be participating in willingly and collaborating and something based more on empathy. Now, we did talk about shifting gears to things that aren't so technical, but are technical in a way. We did talk about images, but I understand, Thibault, you're into photography. Tell us about that.

Thibault: I am. When I have some free time and with Kubernetes and a seven-month-old daughter, it's a bit hard to find time for that. Yes, I do. I went for different stages in photography from concert to street photography lately and studio and all of this. Hard to have time to do it. My favorite subject right now is basically my seven-month daughter.

Bart: But that's a beautiful thing to say. And I think that's nice. And obviously, plenty of opportunities to take pictures of, you know, growing up. Just a quick question. Analog, digital, both?

Thibault: Yeah, I'm mostly on digital, although I have plenty of analog cameras. And I started with analog when I was probably 12 years old, right? So quite a long time in photography in the end. But now, yes, mostly digital.

Bart: Very, very good. And Miguel, I heard you're into Taekwondo. Tell us about that.

Miguel: So, well, I practiced Taekwondo a long time ago. I started in my teens and spent eight, 10 years practicing and hopping around. So I'm, I'm, I think I'm still valid. My title is still valid as a referee for Taekwondo competitions. Yeah, yeah, yeah. But I have been out of practice in the last few years. When I started joining the tech industry, it was much harder to attend. And once you get on calls, they can be disruptive of almost any commitment you may have. Getting involved doesn't help either, or moving around a lot. So, so yeah, that's something that is currently my past. Now I've changed my mock fighting with people into mock fighting with characters. And I'm usually with tabletop RPGs. And actually in Adevinta, we found some people and we made a small group and we meet to play tabletop RPGs and have all these challenges in a bit less complicated fashion.

Bart: Now, but there are lots of things we can extract from this that are very applicable to your experience with the migration. Is that, so a couple of things. What's it like being a referee? I mean, I don't think I've ever spoken to someone who's better. It's really hard because it just like with a lot of things is that you have a lot of emotions in your hands because you've got disappointment. You've got expectations. You have family members that might be in the audience. They're going to be very upset with the decision. How do you handle that pressure?

Miguel: It's hard. It's hard. You handle it by having a good team. So no referee or almost no referee is alone. And especially in the Taekwondo, the referee is not alone. There's six to 10 people in a mat acting as referees at different levels. So you have your team, you have the trust in your team, and then you work with them into making sure that the decision. The referee that takes the decision can consult everyone. So you get the trust of the people that they are going to tell you whatever you're missing. And that once you make the decision, well, you have to live with that. And there's people that will get upset, but they will get upset no matter what you rule. And it becomes a bit zen in the end. Like, yeah, that's how it is. I did my best and I cannot go better than this.

Bart: I love that though. Like I said, it's completely applicable to what we're speaking about. And similar in your case, I think with photography. Because I make videos and things like that, but not at a very sophisticated level. But what a lot of people don't understand is that now also, I think, with technology and phones, it makes it really easy to take pretty decent pictures. And so it sort of seems that you don't necessarily need a good camera. But in order to take a good picture with a good camera, there's a lot of time spent on calibrating, on making sure that we're talking about aperture, if we're talking about the shutter speed. All those different things that go into getting a good picture. There's a lot of prep work. And we get into things, you know, the dry runs to make sure that all these things are going to go right. There's a ton of stuff that goes into that. Do you think there's any connection with how that's helped you approach your job in getting the right balance of factors for something like this migration to move forward?

Thibault: I don't know if I would relate anything, right? What I know is that basically, probably my Instagram is about a few hundred pictures. And I have 200,000 pictures on my laptop, right? So on a hard drive. So that means a lot of trials and errors. Yes, a lot of them. You mentioned the settings. The settings is probably 1% of the picture, right? 90% or 99%, I like to say 99% because it's a bigger number. But 99% of the picture is actually how you look at the scene, how you evaluate basically how things are going, what's happening, what is the landscape that you have. And then you think, okay, so with this background, if I move there, how will this look like? Maybe I need to go closer. Maybe I need to go farther. Because the perspective that I will get is a bit different. And I need to wait to have somebody entering in that specific combination that makes the picture. In reality, this is the hard time of photography. The shutter speed and what's on, I don't care. A phone can do it very good. And you have a lot of very good photographers that actually use phones nowadays. Because this is technology. This is not very interesting. The interesting part, if you get into photography, is actually... this, how you place, how you evaluate the landscape. And this is something that you cannot replace with phones. Maybe a bit with AI. We can argue this. But this is the line of thought that I have. I could quote some friends of mine that are actually running some workshop in photography. And they are actually in the same line of thought, right? So most of the picture is actually how you look at it. And if you think about it, 50 years ago, when we only had film cameras and we have people like Cartier-Bresson that was actually photographing, they didn't have super high tech. And actually, their images are not often that sharp, right? Because this is not what photography is, right? Photography is more about composition, about the light, understanding the light, right? And how the light is going to behave and how things are going to echo to each other, depending on the composition that you take. Probably it relates a tiny bit with this orchestrating the ship product in some cases.

Bart: Yeah. No, no. Wonderful answer. And for me, I asked about the settings. I asked about that just because for me, it's completely overwhelming. But it's really refreshing to know that 99% of that is not related to that. But I understand that the framing, like you said, the composition, how the light's going to be interacting with the elements that are there. Good. Last question that I do have, though. Miguel, since you mentioned tabletop role-playing games, what are some of the things that you think A lot of stuff about that in terms of building coalitions, in terms of thinking about next steps, all the stuff that goes into that. Also, conflict resolution. What tabletop role-playing games are you playing?

Miguel: So right now, we are playing Warhammer Fantasy. So we are playing one of the classic campaigns of the Warhammer Fantasy role-playing game. I think it's from the 90s. It's a bit dated in some parts. But we kind of go from game to game, trying different systems and playing around. We are doing it for fun. And instead of getting into just one game, we try to get a good overview of what has been the history of tabletop role-playing games.

Bart: Very cool. I must say, I asked about this when we were getting ready to do the podcast recording. is that I, in the last five years, I would say I've got into Warhammer 40K, but not so much playing the game as mostly just the lore that surrounds it. And I find it incredible, the amount of storytelling that's gone into it and building characters in these different scenarios. And while, of course, it's fantasy, there's so much of it that relates to conflict resolution, all kinds of different things that we see in modern life. So I find it very relevant and very helpful. So that's great to know. Now, if people want to get in touch with either one of you, what's the best way to do so?

Miguel: Well, if someone wants to reach me, I'm on LinkedIn. And also my GitHub account has my public email. So just ping me, email or by LinkedIn.

Bart: Good. Thibault?

Thibault: Well, I'm on most of the social networks, I would say. So LinkedIn, Twitter, X, and Instagram, right? So those are the mediums that I mostly consult. Maybe not every day, but very often.

Bart: Often enough that the folks want to get in touch with you. I want to know as well, too, what was the reaction to the blog post? And based on that, will we expect future blog posts? What's next for the two of you?

Miguel: I think the blog post has been quite popular. We've also had other blog posts that have had high impact near it. So the metrics are a bit murky. But I've heard from our branding team that the blog post has made huge strides since the summer this year. And it's very popular right now. And a blog post is one of the... I am the top five of the most popular ones in the tech blog, in the tech blog. And, uh, I don't know if we will get more blog posts for Noe, uh, but there's going to be more things in shape. For sure. And we have committed, we have an internal commitment to disseminate what the work we do and show it because, uh, you go to conferences and you see people speaking and with, with people that have been like T-Ball for a long time. And the team is like, okay. Yes. We do that and we do that and we do that. So why the hell are we not there speaking about what we do? So we are trying to change that and trying to speak a lot more about it. The work that happens in team. And one of our colleagues, Tanat Lokejaroenlarb, I think I butchered his name. Sorry. He's from Thailand. I'm not really sure how it's pronounced. But he has a bunch of blog posts also in the Adevinta Tech blog, and he is starting to do some talks as well in public spaces about the work that we do in SCHIP. And he's working a lot as an ambassador for the team right now. Thibault, anything you want to add to that?

Thibault: Well, most of this was said, right? Of course, we will keep writing blog posts, maybe not directly me. I would love to have time and to explain more what we are doing and where we're heading at, because we have plans about it. The team is growing. We are actually doubling, almost doubling the size of the team. If we compare, if we take into action everybody that will contribute to what SCHIP is today. So that means a lot of challenges, right? So a lot of new members that come from different landscapes, and this will be fascinating to basically learn also from them, right? It's not only what we have at the moment and the direction that we are taking at the moment that is important, but more learning from everybody that will be in this team and in this group of teams, because it will be a group of teams. This will be interesting. And hopefully we can write blog posts about it. Not only Tanat, right? Tanat is a great writer. And I would definitely encourage reading his blog posts about our failures, because most of the blog posts are about our failures in SCHIP. But also João, our manager, is also writing a lot about his approach to management on X-Twitter. And this is also pretty interesting to read. Yes, we are in the process of sharing more what we do. And hopefully we can make it to conferences and basically stop thinking that we're not doing that great. Because what we see today is that we are doing great and that we need to make it more known.

Bart: Couldn't agree more. And from listening to your experience and reading the blog posts as well, it's abundantly clear that you are doing fantastic things. Tons of know-how and experience going into this. And I think it's going to be very beneficial for other individuals as well as organizations to hear about your experience. That being said, I hope our paths cross either at KubeCon in Paris or somewhere, some event in Barcelona. Hope to be out there for our CNCF meetups as well too. Keep sharing your knowledge. Thank you for your hard work and dedication. And we'll be speaking soon.

Thibault: Thank you.

Bart: Bye. Thank you.

Listen anywhere

Kubernetes experts reacting to this episode

Kubernetes evolution: automation, cost optimization and the AI-driven future
with Guy Baron
Cost optimization in Kubernetes: a practical guide
with Jasmine James Fuller
The future of Kubernetes optimization: from cost to sustainability
with James Wilson
The future of databases and observability in Kubernetes
with Peter Zaitsev