Tortoise: outpacing the optimization challenges in Kubernetes

Host:

Bart Farrell

Guest:

Kensei Nakada

This episode is sponsored by Learnk8s — estimate the perfect cluster node with the Kubernetes Instance Calculator

In this KubeFM episode, Kensei Kanada discusses Tortoise, an open-source project he developed at Mercari to tackle Kubernetes resource optimization challenges. He explains the limitations of existing solutions like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), and how Tortoise aims to provide a more comprehensive and automated approach to resource management in Kubernetes clusters.

You will learn:

The complexities of resource optimization in Kubernetes, including the challenges of managing HPA, VPA, and manual tuning of resource requests and limits
How Tortoise automates resource optimization by replacing HPA and VPA, reducing the need for manual intervention and continuous tuning
The technical implementation of Tortoise, including its use of Custom Resource Definitions (CRDs) and how it interacts with existing Kubernetes components
Strategies for adopting and migrating to new tools like Tortoise in a large-scale Kubernetes environment

Relevant links

Transcription

Bart: What's so tough and tricky about Horizontal Pod Autoscaler, or HPA? It turns out that things like automatic service parameter optimization make it easier, perhaps among some other things. To know more about that, I spoke to Kenzo Nakata, who told me all about the project called Tortoise. Tortoise is a project designed to replace HPA and VPA, automating resource optimization using data from Prometheus and Metrics Server. Kelsey shared insights into Tortoise's adoption at a company called Mercari, including service migration and simplifying internal implementations to enhance collaboration. In an age where we value speed sometimes too much, why the name Tortoise? Because slow and steady wins the race? Check out this episode of KubeFM and find out for yourself. This episode is sponsored by Learnk8s. How do you choose the best instance type for your Kubernetes cluster? Should you use few and very large instances or small and many? When using an 8-gigabyte 2v CPU instance, are all the memory and CPU available to pods? Is running the same Kubernetes node in AWS cheaper or more expensive than Azure and GCP? The Kubernetes instance calculator answers those questions and a lot more. The calculator is a free tool that lets you estimate costs for your workloads based on requests and instance sizes, explore instance overcommitment and efficiency, and identify over and under spending by model error rates on your actual memory and CPU usage. Compare instances between different cloud providers. It's an easy way to explore cost and efficiency before writing any line of code. You can find the link to the calculator in the comments. Alright, Kelsey, what are three Kubernetes emerging tools that are catching your attention?

Kensei: Right now I have, okay, so first, gateway API. It's the offshore project from SIG network, I guess. It's designed in a different way from other projects. It's just a framework and provides a unified interface to users. There are many controllers that implement the interface. For example, there are Envoy Gateway, Kong Gateway, Istio Gateway, etc. All of them work well with a unified interface to users. Users can choose the implementation based on their use case. When they want to use Envoy, they use Envoy Gateway and so on. That's the interesting one for me. The other one, in-place pod resizing, is not a standalone tool, but an official... new feature in Kubernetes pod. I've been looking forward to this feature for years. It enables pods to be resized after they start. Currently, pod resource requests and limits are immutable fields, so they cannot be edited after the pod starts. This feature will enable those fields to be edited after creation. That's going to be awesome in terms of auto-scaling. We have VPA, which will support this feature and change pod requests after it runs based on the resource consumption it measures. That's going to be interesting. The last one is SpinKube. I'm not sure about the pronunciation of it. It's basically a serverless solution on Kubernetes based on Wasm. We know Wasm nowadays is not only for web browsers but also for Kubernetes. Around Kubernetes, there are some initiatives trying to utilize Wasm as runtime or modules in Kubernetes. I could see many possibilities around SpinKube because it takes less time to start up and finish compared to containers. Maybe we can use resources more wisely, etc. There are many possibilities, I think. That's all three projects that I'm looking forward to.

Bart: Very good, and very, very thoughtful answers. Particularly that you started out with something related to networking, heard some very unpopular opinions or well... some very widely shared opinions about ingress and how people are excited about the arrival gateway API. That's very nice to hear from your perspective as well, that you've seen some value added there. Also nice, and we'll touch on this a little bit later, mentioning Wasm in the last KubeCon feels like there's more attention on Wasm than in other ones. It's good to see that more energy is being directed there. To get to know you a little bit better for people who don't know you, can you tell us about who you are, what you do, and where you work? Sure.

Kensei: So I'm Kensei Nakada, a software engineer based in Japan.

Bart: That's great. How did you get into Cloud Native? What were you doing before Cloud Native?

Kensei: My first step was Google Summer of Code in 2021. Google Summer of Code is a mentoring program for open source projects. I was selected for the Kubernetes project and developed a Kubernetes scheduler simulator. This was a great opportunity for me to get involved in the open source Kubernetes community. Afterwards, I started contributing to the scheduler, not only in the simulator but also for the Kubernetes scheduler in general. I barely used Kubernetes at work at that moment, but the contributions grew my interest in Kubernetes and Cloud Native technologies. I started to feel like I wanted to explore other job opportunities. Before Cloud Native, I worked as a backend engineer, mostly with Golang. I got an offer as a backend engineer for Mercury and started working in the search team initially. However, I was very interested in Kubernetes and Cloud Native technologies, so I asked my manager to move me to the platform team. That's how I got into Cloud Native.

Bart: Now, you've been involved since Google Summer of Code, contributing to Kubernetes in different ways for the past few years. How do you keep updated with Kubernetes and the Cloud Native ecosystem? Things move very quickly. What's your strategy for staying on top of things?

Kensei: So basically, I mostly just follow the updates in Kubernetes itself. I mean, upstream Kubernetes. I don't follow other tools on Kubernetes actively. I get to know such tools when they gain momentum, like SpinKube. I get up-to-date information, new features, etc. through my contributions. I got peer review for some new features and had to look at K8s itself. This is how I usually get to know new features, especially around my area in Kubernetes contribution. I also follow Last Week in Kubernetes Development, which is a kind of famous website. This is a really nice project to follow what is going on in other parts of Kubernetes because Kubernetes is a large project. I'm part of the contributors, but I don't know all of the changes going on. So I follow the articles from Last Week in Kubernetes Development and try to keep myself up to date with the projects.

Bart: Last question in our intro section. If you could go back in time and share one career tip with your younger self, what would it be?

Kensei: I would tell him to start studying English early so that my English today wouldn't have this cute accent. Every time I spoke at that conference, I was like, how should I handle it if I couldn't understand questions? Sometimes I feel such a language barrier, a language hurdle. Even recently, Tetrate is like a foreign company. Tetrate is an American company, but even recently I feel such language barriers. So that's one point I would give him.

Bart: One thing that I would like to say to that though, as someone who's from the United States, but I've been living in Spain for the last almost 13 years and working in different languages. One of the great things about the Kubernetes ecosystem is that we have people who all speak English in their own way. It's an opportunity for people to learn English as well as for native English speakers to learn how to help others. Try to get the things done that they want to do, to respond to their questions, to learn how to say things, sometimes in the most direct and clear way possible, because what's obvious to one person may not be obvious to another. I can tell you right now, Kensei, my Japanese would not go very far in this conversation. So I understand why you make that point. I think there's one part too about language. I think there's another part, like you said, about giving a talk in another language. That's challenging. Giving a talk in any language is challenging. Most people are terrified of public speaking. So I congratulate you for doing that. And that's no simple task. One part is the technical side, but the other part is sharing that information, organizing it, trying to help people. Learnk8s kudos to you for doing that.

Kensei: Thanks.

Bart: As part of our monthly content discovery, we found this article you wrote titled Tortoise Outpacing the Optimization Challenges in Kubernetes at Mercari. The questions we're going to ask today will explore what was covered in the article further. You originally wrote the article while employed at Mercari. Before we dive into the actual article, can you explain your role? What were you doing as a platform engineer, and what was your team responsible for?

Kensei: Right. So in Mercari, we have a platform team, which is divided into four sub-teams: network, CI/CD, infra, and platform. Among those four teams, I'm in the platform infra team, and we are responsible for the basis of our platform, such as Kubernetes clusters or providing other infrastructures from clouds. We are basically using a cloud-based architecture or you can say Kubernetes-based architecture too. Apart from that, we also have a responsibility for observability. So those two are the main responsibilities for us to build our team. Also, at Mercari, we have a company-wide initiative for spin-offs. So basically, almost all the applications in Mercari are running on top of our platform. The platform team can make a great impact on this initiative because when we make a change in infrastructure, it impacts all the applications. That's why the platform team contributed to this spin-off initiative a lot in terms of infrastructure.

Bart: And when we're talking about cost optimization in Kubernetes, it's a broad topic.

Kensei: There are many parts that we can optimize in Kubernetes clusters, such as computing resources, networking, and observability tools. I mean, tools in general. All of those are really expensive in Kubernetes. In Mercari, each platform team works on optimization that they are responsible for. The platform or infra team takes care of optimizing our computing resources and observability tools. I actually focused on computing resource optimization. We considered that there were two main parts in this. Computing resource optimization includes node-level optimization and pod-level optimization. Regarding the node-level one, it includes instance type optimization, cluster AutoScaler configuration, etc. One example is that we try to utilize spot instances a lot because spot instances are cheaper than other on-demand nodes. That's one thing. Another thing is we tried to use other instance types such as T2D in GCP. It also made a huge cost optimization and it is very nice. Given it's our responsibility to manage such infrastructure, it's easier for us to make such changes, I mean, infrastructure-level changes. We tried this kind of optimization first, but as I mentioned, we also have pod-level optimization. It includes optimizing resource requests or limits of each pod, or optimizing auto-scaling parameters, etc. Those are per-service optimizations and it's usually the service developer's responsibility, not the platform team's responsibility. Ideally, each service developer should be responsible for this kind of optimization and should look at how the resources are consumed and optimize them as necessary. That's how we divided computing resource optimization into two groups. And that's how it goes.

Bart: Optimizing apps requires a deep understanding of the app itself.

Kensei: As I mentioned, ideally it is the developers' responsibility to optimize their parts. We know that optimization requires a lot of knowledge around Kubernetes, and not all service developers are Kubernetes experts. That's why we have a platform team that takes care of Kubernetes for them, building tools or an abstraction layer so they can work on top of Kubernetes without in-depth knowledge. Regarding optimization too... We, as Kubernetes experts in the company, have to be involved in pod-level optimization as well. However, Mercari has tons of microservices, like more than 1,000 departments or something. It's unrealistic for the platform team to work closely with every team and optimize each one together. Our strategy was that we don't help with optimization directly. We tried to reduce the engineering cost of optimization and asked each team to work on optimization by themselves, not us. Specifically, we first documented the best practices around optimization, but this is still not enough for developers because they have to go through all the steps. We also provide a useful tool named Slack Resource Recommender. It's a Slack bot. It suggests the best resource request based on the resource consumption history of the service. People can apply the recommendation from the bot, and everything should go well. That's our strategy for pod-specific optimization.

Bart: Can you go a little bit further on that about what exactly is the resource recommender Slackbot? Is it based on the VerticalPod AutoScaler?

Kensei: Oh yes. So it's based on VPA, VerticalPod AutoScaler. VPA is the official sub-project from SIGS Scheduling, I guess. It keeps track of the resource consumption history of each target service. It calculates the best resource request based on the history and eventually mutates each pod's resource request based on the calculation result. VPA is great. It's great. But we found that it could cause a disturbance in services. So we decided not to use it directly. Instead, we implemented the resource recommender Slackbot based on VPA. Specifically, VPA has to replace pods when they want to apply the recommendation. When it wants to change pod resources, it evicts, it deletes each pod so that ReplicaSet will create another pod. For this creation, VPA mutating webhook will mutate the resource request on this new pod. That's how it applies the recommendation, but it has to delete some pods for replacement, which results in some disturbance in services. Another problem is that it sometimes makes frequent restarts when the recommendation changes very frequently. That's another reason we wanted to avoid using VPA directly. That's why we built the resource recommender not directly using VPA but based on VPA's logic. It takes one month of history of the recommendation from VPA and just suggests it. to use as every month's Slack message. Having a bot that makes those kinds of recommendations around requests and limits would be very attractive to a lot of people. Are there any downsides or shortcomings of it, things that people should be aware of? Actually, we faced some challenges with this strategy with this resource recommender. The first one is the accuracy of the recommendation. As I mentioned, the recommendation is calculated based on the one-month history of VPA recommendation. But the problem is that the recommended resource values start to get stalled soon, gradually after they are sent. Mercari is, of course, developed and changed every day. Any changes such as implementation changes in applications or traffic pattern changes could influence the value of the recommendation. The recommendation is only accurate at the moment it is calculated. Therefore, if people apply an old recommendation that was sent a week ago, the application could end up out of memory in the worst case. That's the first problem we faced. The next one is that people just ignore the recommendation message we send. It's their responsibility to use it or not. Because of the first problem, the recommendation is not 100% safe. It's like a hint. They're uncertain about the actual safety of the recommendation. They have to ensure if the recommendation really makes sense based on the recent resource consumption. They also have to ensure that it works safely after applying the value. Those are troublesome for users and take some time for each team's engineering time. People get less and less interested in recommender messages and eventually some people ignore the messages. That's what happened here. The last one is fundamental, but the optimization never ends as long as the service keeps running. It's not only about the recommender. Every change could change the recommendation, like the best value of each parameter. That means every developer has to continuously put effort into tuning parameters. It's literally endless. The optimization will eat the engineering time forever until Mercari itself ends. Mercari has tons of microservices, so this small engineering time piles up every month or every week. It is a burden for us to keep this situation, this manual effort to optimize the resources. Those are the challenges we faced with Resource Recommender.

Bart: Alright, we can see that there's a balance to keep in mind of the advantages and disadvantages, but it seems that it's better to have those challenges than to not have the bot at all. Despite these shortcomings or, like you said, the difficulties that can be experienced, I imagine that this can still lead to some substantial savings, correct?

Kensei: Yes, so of course, it's better than not to have the bot at all in the first place, but actually couldn't make a huge impact. There's another reason here. So in Mercari, we use HPA a lot. HPA is Horizontal Pod Autoscaler, and you can set the target CPU utilization like maybe 80% or something. If HPA detects the service uses more than 80%, it increases the number of pods. That's how it works. As I mentioned, the platform team documents the best practices around Kubernetes, and we recommend people use HPA for scaling, especially CPU. I would say most of the huge services have HPA. There are more than several hundreds of HPA in the cluster. When HPA manages CPU, the problem is that the recommender cannot really do anything with CPU optimization. The reason is that, for example, if your HPA has an 80% CPU target, then that means ideally CPU is always 80%. But let's say CPU is actually always 80%, and the recommender finds this application only uses 8 cores out of 10 cores always. The resources we recommend at the end suggest the service reduce CPU requests to 8 cores. But this is wrong. Because if we trust this recommendation and reduce the CPU requests to 8 cores, then HPA will target 80% of 8 cores max. So it doesn't make any change in terms of resource utilization. That's why I said, when HPA manages CPU, the recommender cannot do anything with CPU optimization. In fact, we have many HPA in the cluster, which means the recommender cannot optimize a big part of CPU in the clusters. You may think that if a service has an HPA for CPU, we don't really need to care about CPU optimization because HPA optimizes it. But it's correct only as long as you can keep your HPA itself optimized. HPA can optimize the resources, but it doesn't mean your HPA always does a great job. For HPA to do a great job, we have to do a great job optimizing HPA's parameters or sometimes resource requests too. Mercari has a lot of HPA, but we still have to optimize CPU further. It means HPA doesn't do a great job in terms of CPU utilization. That's the reason we found that the resource recommender is great, but not enough at all for us.

Bart: So we have two different things going on here.

Kensei: Yes. So first we go okay let's support HPA recommendation in the resource recommender bot, but we know that it would still hit the same problem as the resource recommender does because people ignore the messages, etc. So we started to think about how to put another idea here from the recommender. The core problem is that there are too many parameters that have to be optimized by users. So we have HPA parameters, we have resource requests, we have resource limits. All of them we have to tune for each service. And also we have tons of microservices with those parameters. And every optimization requires knowledge and some human effort. Which means we cannot just ask the service owner to optimize every service they own. At the same time, we realized that maybe we can do all of the optimization automatically with minimal manual effort. That's what we thought. As I mentioned, we know VPA as an automated solution for tuning the resources. But we found some issues, some challenges, and we decided to avoid using VPA. Maybe we can create a wrapper or alternative for VPA to overcome those challenges. Regarding HPA optimization, we know several scenarios that make your HPA unoptimized. We also know how to deal with those challenges, how to deal with those scenarios. Maybe we can automatically detect these kinds of inefficiencies in HPAs and optimize them. Those are the ideas that we had at that moment. Based on them, we started an open-source project called Tortoise. It aims at replacing HPA and VPA and doing the optimization instead of humans. The core point is that Tortoise only exposes a few parameters that don't have to be optimized, such as the target deployment name. It means that once users start to live with Tortoise, they will never have to optimize something in Tortoise parameters because there are no parameters in it to optimize. So that's the idea of the world that we wanted to create with Tortoise.

Bart: Very good. I know you mentioned the concept of being a wrapper. Can you dive into this a little bit deeper? We're thinking about CRDs and data storage. Is it just a wrapper for the HPA?

Kensei: It provides one CRD named Tortoise, and Tortoise has HPA and VPA under the hood, actually. So it means if you create Tortoise, then it creates underlying HPA and VPA for the targets like this. But it's not just an alias for setting up those HPA or VPA for the target, because Tortoise operates those two underlying autoscalers to achieve the best utilization, which is... they cannot do by themselves. We have to optimize the HPA to make CPU utilization better. But when it comes to Tortoise, it doesn't have to be optimized once it's created. So that's the huge difference between underlying base autoscalers and Tortoise. So we can say Tortoise is a maker abstraction layer for HPA and VPA. And once starting to use Tortoise, users don't have to tune underlying things. They just have to concentrate on what Tortoise exposes and they don't have to really care about underlying optimization.

Bart: If we consider the standard stack for collecting and recommending metrics, we often immediately think of Prometheus. If we're talking about monitoring observability, things of that nature, we'll think about Metrics Server and HPA. Does Tortoise replace all of them?

Kensei: Alright, so Tortoise replaces HPA and VPA, but it doesn't replace Prometheus or Metrics Server. Let me elaborate more about Tortoise internal. VPA's recommendation is just a P90 or P95 of resource consumption, and Tortoise has to know the historical resource consumption to generate some recommendations, but Tortoise itself doesn't have any background to it. It just refers to VPA as a reference of resource consumption history. Talking about the dependency, Tortoise replaces HPA and VPA, and sometimes it refers to VPA to generate some recommendations. VPA refers to Metrics Server to know the actual recurrent resource consumption and uses Prometheus as a storage of VPA recommender. In that sense, Metrics Server and Prometheus are indirect dependencies of Tortoise, but Tortoise doesn't aim to replace them at all; it just uses them. The point is how it replaces HPA and VPA. Tortoise receives input from users as it has a CRD named Tortoise and uses input like target service name or target department name. They also configure how each resource is to be scaled. For example, you can configure to scale your app container's CPU horizontally and scale your app container's memory vertically. You can configure per container per resource scaling strategy. For the resource to be scaled vertically, Tortoise keeps collecting the recommendation from VPA, calculates a safe recommendation from it, and applies the recommendations to the pods. It behaves similarly to VPA, but VPA has some problems for us. Tortoise has differences from VPA to deal with those problems. It doesn't use VPA's recommendation as it is because we want to reduce the frequency of the changes in the recommendations. This is part of the reason we wanted to avoid using VPA. Another point is that instead of eviction, Tortoise performs a rolling upgrade. Some of you may know the Kubernetes CLI command kubectl rollout restart, which restarts the deployment you specify. Tortoise does exactly the same way as the command, so the replacement is performed while considering the rolling upgrade strategy defined in the deployment. For example, if you have a deployment with three replicas and use VPA, VPA has to delete one or two pods depending on your PDB for the replacement. It deletes one pod and creates another pod with new resources, but during the replacement, the deployment has to deal with all the traffic with fewer replicas. This is the situation we wanted to avoid. Tortoise increases one replica with new resources, then deletes one replica, avoiding the dangerous situation where fewer pods have to handle the same amount of traffic during the replacement. So that's another point where Tortoise is different from VPA. Lastly, it supports some Golang environment variables. Mercury is a heavy user of Golang, and Golang has some environment variables that have to be changed together with your resource request. This is not supported by VPA. We have to handle it as well in Tortoise vertical scaling. I also wanted to add how it scales resources horizontally. Tortoise keeps optimizing HPA parameters and lets HPA scale pods as it wants. I won't get into much detail on how it calculates the target utilization of HPA because it's complex, but if you are interested, you can check out the document in the Tortoise repository. We have a lot of awesome documentation. What Tortoise optimizes for those resources is not only HPA parameters. There are some scenarios where HPA isn't optimized because of the resource request. To achieve fully optimized CPU, you have to optimize the CPU request in addition to HPA parameters optimization. This is the difficulty of HPA optimization in general. Tortoise deals with such complex scenarios so users don't have to be aware of them. Tortoise handles both horizontal and vertical scaling. A platform engineer often becomes a migration engineer. In Mercury, there were always many ongoing migration initiatives. For Tortoise, we know the pain of migration, so from its design phase, we tried to make Tortoise as easy as possible in terms of migration. Many microservices already have HPA, which would conflict with HPA created from Tortoise. We implemented an option that enables people to attach existing HPA to Tortoise so they don't have to go through a complicated migration process such as creating and deleting HPA, then creating Tortoise and ensuring Tortoise has another HPA. All those complexities are removed by this strategy. It worked quite well and made our migration path very smooth. We made a bunch of PRs for every developer so they can just merge the PR and migrate their HPAs to Tortoise. This worked quite well, but we had another problem regarding migration. We could make our migration path smooth in terms of functionality, but we've seen many people's hesitation because Tortoise is a new component, a new tool that we created. I understand their hesitation. They don't want to migrate their services to Tortoise until they see some other major services successfully using Tortoise. Many teams want to migrate after other teams have migrated, and they're waiting for each other. To deal with this situation, we had some open sessions in the company to describe what Tortoise is, why Tortoise is necessary for us, and how we deal with our pain with resource recommenders or resource optimization in general. We could get some early adopters from these sessions. Until I left Mercury, we tried to grow the number of adoptions starting from these early adopters. We could then use those early adopters as advertisements, showing that the service is already using Tortoise, encouraging others to try it. These migration efforts aim to eventually achieve a full migration to Tortoise from HPA.

Bart: Any other reasons that you think made Tortoise so successful at Mercari?

Kensei: One key point is that the responsibility shifts. Again, this only exposes a few parameters to users. and manages underlying autoscaling automatically under the hood. It allows the company to shift the responsibility of the optimization of each pod from service developers to the platform. Now it's the platform's responsibility to optimize those pods or autoscaling. Tortoise's responsibility is the platform team's responsibility. If Tortoise cannot fully optimize any of the microservices, then that's not the fault of service developers. It's our fault. We would try to improve Tortoise to fit their use cases. So that's how... we could make the shift from service owners to platform developers in terms of per-service optimization of the resources. On a more technical aspect, many of Mercari's microservices are gRPC or HTTP services, and they are mostly written in Golang. Many of them are actually created based on the internal template of microservice implementation. It also helps us create one unified solution to automatically fit all the use cases in every microservice. We could create one solution for optimizing all other microservices.

Bart: If you had to go back in time and do it all over again, is there anything you would do differently?

Kensei: In terms of designing an interface, we did a great job. We only exposed a few parameters to users, allowing them to easily fill in their microservice name, etc. Internally, it has a complex flow in recommendation or changing pods resource requests. All of those flows are very complicated, and I left Mercari. As the Tortoise creator, I left Mercari. The maintenance cost of Tortoise might be very high for other people on the team, so I had to transfer all my knowledge about Tortoise to other team members when I left Mercari, but I'm not sure how it went. I hope it went well. Going back to this question, if I could go back in time and do it all over again, I would probably make internal structures or implementations simpler so that everyone could contribute more easily. That's definitely one thing I should have done.

Bart: That's fair.

Kensei: Well, I have two tortoises in my house, and they are sleeping, so I can bring them here. Mercari sometimes has internal hackathon events with all the employees, and Tortoise was an experimental project that I started in one of those events. I had the right to decide the name, so I named it Tortoise without any particular reason. I just love turtles, so I named it Tortoise. That's it. All right, that's cool.

Bart: Would you be willing to share the names of the two tortoises that are sleeping right now?

Kensei: One is Azuki and another one is Okada. Both are Japanese names, but I cannot make a direct translation.

Bart: No, that's fine. It's just cool to know. It's cool to know. It's very, very cool to know. Maybe when we finish the recording or next week, you can send us pictures. We can use those as well. That's super cool.

Kensei: If you have pictures.

Bart: That's super cool. That's great. I was not expecting that. So you do this project and you keep wanting to stay busy. You're the author of another interesting project. Let's see if I can get this right. KubeScheduler Watson extension. First of all, is that correct? And second, if it is, how did that come to be?

Kensei: All right, so the scheduler currently has two existing extensibilities, like Webhook-based one and Go-Plugin SDK, but both have some downsides. We tried to seek another way of providing the extensibility from the scheduler so that everyone can more casually extend their scheduler based on their custom use cases. That's why we started to look into Wasm as an option. Actually, it's a first attempt in Kubernetes, I mean, Kubernetes official community to try this kind of Wasm extensibility. I know Envoy has a similar concept, similar extensibility that provided real Wasm runtime, but at least in Kubernetes official community, I haven't seen any of them. Our scheduling team has got a lot of insights around Wasm, so maybe we can share with other teams. Maybe some other teams want to do similar things. Let's see what's next for you. I just started that new job at Tech Trade, so I'm trying to get onboarded. That's my current first priority. In the open source world, we are busy with some huge internal enhancements around the scheduler. We are trying to improve our scheduling soup. That's another main goal that I want to achieve soon. If you see some crashes in the scheduler in new binaries, then that might be my fault.

Bart: Good to know. And if there is something wrong or if there's something right because of all the cool things you're doing, what's the best way for people to get in touch with you?

Kensei: Alright, you can reach out to me on my Twitter. Sorry, no longer Twitter, it's X. I also have an account on LinkedIn, so feel free to get in touch.

Bart: Kelsey, really appreciate your time today. Really appreciate all the work that you're doing. Not only doing the work, but sharing your knowledge in public, which is a great service to people, giving talks, writing blogs. You really can't ask for more. So thank you so much. Really appreciate it.

Kensei: Thanks for this opportunity.

Listen anywhere

Kubernetes experts reacting to this episode

Kubernetes evolution: automation, cost optimization and the AI-driven future
with Guy Baron
Platform engineering and the evolution of enterprise Kubernetes
with Sai Sandeep Ogety