5,000 pods/second and 60% utilization with Gödel and Katalyst

5,000 pods/second and 60% utilization with Gödel and Katalyst

Host:

  • Bart Farrell

Guest:

  • Yue Yin

Learn how ByteDance manages computing resources at scale with custom Kubernetes scheduling solutions that handle millions of pods across thousands of nodes.

Yue Yin, Software Engineer at ByteDance, discusses their open-source Gödel scheduler and Katalyst resource management system. She explains how these tools address the challenges of managing online and offline workloads in large-scale Kubernetes deployments.

You will learn:

  • How Gödel's distributed architecture with dispatcher, scheduler, and binder components enables the scheduling of 5,000 pods per second

  • Why NUMA-aware scheduling and two-layer architecture are crucial for handling complex workloads at scale

  • How Katalyst provides node-level resource insights to enable efficient workload co-location and improve CPU utilization

Relevant links
Transcription

Bart: In this episode of KubeFM, I had a chance to speak with Yue Yin, a software engineer on the cloud-native infrastructure team at ByteDance. Yue specializes in building scalable distributed systems. Today, we will focus on her team's groundbreaking work on the Gödel scheduler, a next-generation Kubernetes Scheduler engine. Gödel addresses challenges in resource management at scale, such as inefficiencies in handling both online and offline workloads. By introducing advanced features like NUMA-aware scheduling, two-layer architecture, and distributive processing, Gödel delivers unmatched performance, scheduling up to 5,000 pods per second across clusters of a million pods. In this episode, we will also discuss ByteDance's Katalyst resource management system and how it complements Gödel by providing node-level resource insights, enabling seamless workload co-location. Drawing from articles on KubeAdmiral and Katalyst, Yue explains the innovations driving these projects, their impact on ByteDance's operations, and how open-sourcing Gödel contributes to the broader cloud-native ecosystem. This episode is sponsored by LearnK8s. LearnK8s has been providing Kubernetes training all over the world since 2017. Courses are instructor-led and are 60% hands-on and 40% theoretical. They are offered to groups and individuals, and students have access to the course materials for the rest of their lives. LearnK8s provides training both in-person and online. To find out more, go to learnk8s.io. Now, let's get into the episode. So, welcome to KubeFM. What are three emerging Kubernetes tools that you are keeping an eye on?

Yue: So first is Grafana for observability. And second is kueue for job queuing system of batch services. And the last one is KubeRay, which can help run distributed workloads with real applications in the Kubernetes cluster.

Bart: Now, to get to know you a little better, can we get a short introduction about what you do and where you work?

Yue: My name is Yue, and I'm working as a software engineer at ByteDance Cloud Native infrastructure team, where my focus is on orchestration and scheduling.

Bart: And how did you first get into Cloud Native?

Yue: I started my career as a software engineer at VMware, where I worked on the product called Tanzu, which integrates Kubernetes into vSphere and enables users to manage Kubernetes workloads on vSphere. Now, on the Cloud Native Infrastructure team at ByteDance, I am focused on building scalable distributed scheduling systems and applying and deepening my expertise in cognitive practices across large-scale products.

Bart: And what were you before becoming cloud native?

Yue: So, I entered the cloud native field immediately after graduating from my master's program.

Bart: The Kubernetes and cloud native ecosystem moves very quickly. How do you stay up to date? Is it with blogs, videos, tutorials, or podcasts? What works best for you?

Yue: So actually, most of my recent updates came from attending KubeCon North America 2024, this year, which was held in Salt Lake City. And I also believe blogs and podcasts are excellent resources for staying informed with the latest updates in the community. So I'd like to dedicate more time to exploring those moving forward.

Bart: Now, if you could go back in time and share one career tip with your younger self, what would it be?

Yue: The career tip would be to try to build a strong foundation in core concepts and fundamentals, because mastering these fundamentals makes it so much easier to adapt to new technologies as they emerge, such as KubeRay, Kubeflow, or Kueue. It also provides a deeper understanding that's invaluable as you take on more complex challenges in your career, even though they may seem less useful than some fancy framework or tools that you could put on your resume, like experience with Katalyst, KubeAdmiral, or Gödel scheduler.

Bart: Now, as part of our monthly content discovery, we found three different articles that we will be focusing on in our conversation today. One article is about Katalyst, a management system for workload co-location in Kubernetes. Another article is about KubeAdmiral, a next-generation multi-cluster orchestration engine based on Kubernetes. The last article is about Gödel, an open-sourced and unified scheduler for online and offline workloads. We will be discussing these in more detail. We understand that ByteDance has experienced rapid growth across various business lines. What challenges did this growth present in terms of managing computing resources?

Yue: ByteDance's rapid growth brought about challenges in managing computing resources, particularly with its initial setup of separate resource pools for online and offline workloads. The isolated pools worked well in low-demand scenarios but created serious inefficiencies during peak times. With workloads rising unpredictably, maintaining performance required manual reallocation, which was labor-intensive and prone to errors. This manual setup also led to high allocation costs, as resources often sat idle or were underutilized in some areas. Recognizing these inefficiencies, ByteDance saw a need for a unified scheduling system that could dynamically balance resource demands across both workload types to improve overall efficiency and reduce costs, and also automate reallocation in real time, similar to a Kubernetes Scheduler, and potentially utilizing concepts like Workload co-location and Pod Topology.

Bart: Now you mentioned making optimizations to the Kubernetes Scheduler. What were some of these optimizations and how did they improve your scheduling capabilities?

Yue: We introduced essential features to meet various business requirements. For example, we implemented NUMA topology plugins to enable NUMA-aware scheduling, enhancing the experience for latency-sensitive workloads, which can be achieved using Pod Topology. On the other hand, we focused on performance optimizations and achieved a throughput of 300 pulses per second in a cluster with 10,000 nodes at that time.

Bart: Despite these improvements, you ultimately decided to develop Gödel Scheduler. What factors led to this decision?

Yue: Despite the optimizations in capabilities and performance that I just mentioned, the resource isolation issue we mentioned earlier still hasn't been fully resolved with this early approach. We realized that achieving a unified resource pool requires a scheduling system capable of effectively handling both online and offline services. While the native Kubernetes Scheduler is designed for pod-level scheduling, our needs demand support for more complex job-level scheduling semantics with a significantly higher throughput requirement. This realization led us to develop Gödel.

Bart: Now, how does the architecture of Gödel scheduler differ from that of the Kubernetes scheduler?

Yue: So, Gödel scheduler introduces the concept of a scheduling unit that extends beyond the traditional POD-centric approach. This has allowed Gödel scheduler to support offline batch scheduling, where the pods within a specific batch need to be scheduled together. To achieve this, Gödel scheduler implements a two-layer scheduling framework. The first layer is job-level scheduling, which filters and ranks feasible nodes for a group of pods. At the second layer, the pod level, it makes detailed scheduling decisions for each individual pod in that job. Gödel scheduler is also designed as a distributed scheduling system, with the scheduler divided into three components: dispatcher, scheduler, and binder. The dispatcher serves as the entry point to the scheduling system, directing pods to specific scheduler instances. The scheduler handles the core computation, matching pods to nodes. Given the heavy workload of this component, Gödel scheduler spawns multiple scheduler shards to enable parallel processing for complex scheduling decisions. Finally, the binder component performs conflict checks and binds pods to their designated nodes, which were nominated by the scheduler instance, completing the whole scheduling process.

Bart: Now, in terms of performance improvements, what have you seen with the Gödel scheduler and how have these impacted ByteDance's operation?

Yue: With the implementation of a Gödel scheduler system, we have achieved a throughput of 2,000 pods per second with a single shard of scheduler instance. We can achieve 5,000 pods per second with multiple shards of scheduler instances in the production environment. For our cluster scale, we can support up to 20,000 nodes and 1 million pods in production.

Bart: Do you think that the Kubernetes Scheduler could eventually achieve similar performance with further optimization?

Yue: That's a good question. With further optimizations, we think the Kubernetes Scheduler could approach higher levels of performance. However, there will be inherent trade-offs because the Kubernetes Scheduler design prioritizes flexibility, extensibility, and portability. Achieving similar throughput as highly optimized schedulers might require some fundamental changes, such as incorporating more efficient algorithms or refactoring the scheduling framework.

Bart: Now that ByteDance has decided to open source Gödel Scheduler for enhancing parallelism, what motivated this decision?

Yue: The Gödel Scheduler has been rigorously battle-tested in ByteDance's hyperscale production environment and has demonstrated exceptional performance. We published a paper on Gödel: Unified Large-Scale Resource Management and Scheduling at ByteDance, which was accepted at SOCC last year. The paper and the talk sparked significant interest from the industry, and we received a lot of inquiries about Gödel's design and functionality. This initially gave us the idea to open-source Gödel. Since we have greatly benefited from the open-source community, we are excited to give back and hope that Gödel can assist other organizations in tackling similar challenges.

Bart: Now, looking towards the future, what are the future plans for Gödel Scheduler? Are there specific areas you are focusing on for improvement?

Yue: We have been excited to work on open sourcing the Gödel scheduler. The rescheduler is a component in the Gödel scheduler that works alongside the Gödel scheduler to handle rescheduling after initial placement, as the cluster state changes over time. This allows the rescheduler to optimize scheduling results for a longer period. We are also exploring an all-in-one mode for smaller clusters to simplify Gödel scheduler deployment and make it more portable. Currently, we only support distributed mode in production, so we aim to provide an all-in-one alternative for smaller-scale clusters. Additionally, we are improving the Gödel scheduler framework's performance, standardization, and flexibility. Our goal is to grow the ecosystem by providing APIs compatible with popular upstream frameworks, making it easier for organizations to adopt and integrate Gödel scheduler as part of their toolkit.

Bart: Great, we've also encountered mentions of another system developed by ByteDance called Katalyst. Can you explain why you developed Katalyst alongside Gödel Scheduler?

Yue: So, Katalyst is a resource management system developed in-house at ByteDance, and it is also available as open source. Gödel relies on the node-level resource information provided by Katalyst to make advanced scheduling decisions. For example, Gödel supports macro topology scheduling, which requires NUMA-level resource usage data for various workloads to determine whether some workloads can be placed on a specific node. This NUMA resource includes resources dedicated for offline workloads when online workloads are not using them, and this information is also provided by the Katalyst resource management system. By using this information, Gödel can achieve real co-location for online and offline workloads on the same node. As a result, a unified resource pool is only possible with Gödel and Katalyst working together in the same system. Because of this, we developed Katalyst in ByteDance, and because of Gödel and Katalyst working together, we achieved this unified resource pool, which helped us successfully increase our average CPU utilization from 30% to 60%.

Bart: Why the achievement? Gödel replaces the Kubernetes Scheduler. Katalyst is an operator that wraps requests, limits, and allocations. How much of Kubernetes is left in ByteDance? It sounds like what you've created differs from Kubernetes in many ways.

Yue: At ByteDance, our Kubernetes setup stays true to its core architecture. Projects like Gödel and Katalyst do not disrupt the ecosystem; instead, they can be thought of as powerful add-on systems that tackle specific challenges in large-scale environments. They do not replace Kubernetes' core principles but build on top of them, adding the extra capability needed to handle more complex workloads. Both Gödel and Katalyst are fully compatible within the Kubernetes ecosystem. By providing these systems, we are not only solving our own scalability challenges but also giving back to the open source community, providing other organizations with more options for critical Kubernetes components.

Bart: What's next for you?

Yue: First, I will continue to contribute to our cloud-native orchestration system, which plays a critical role in helping us adapt to evolving business demand. There is still plenty of room to optimize and expand its capabilities to ensure that we meet those business demands effectively. Additionally, we are committed to contributing to the open-source community through projects like the Gödel Scheduler and Katalyst, so we can help drive broader innovation in workload management in the industry.

Bart: Now, Yue, I also understand you have two cats. What are their names?

Yue: My cats are Fubuki and Hello, with Hello being the younger one.

Bart: Very good. And what open source projects are they working on?

Yue: Probably, they can help them contribute to the open source cat photos community, and hope people will like them.

Bart: I'm sure we'll have to include this in the show notes, so don't forget that. Also, when you're not working on these amazing open source projects, what do you like to do in your free time?

Yue: First, I play with my cats, of course, I spend a lot of time with them. And I live in the Bay area, so there are a lot of great hiking trails nearby. So you're already doing the weekends. I'll explore the hiking trails with my husband.

Bart: We like to ask our guests about the things they find challenging about Kubernetes and the things they would like to see resolved. Kubernetes turned 10 years old this year. If there's one thing you would like to see improved in Kubernetes, what would it be and why?

Yue: One thing that could be improved is having more performance-focused topics in the overall community. I think that would be very valuable.

Bart: I know you mentioned one of the Kubernetes emerging tools that you were interested in was KubeRay. Given that it sounds like you've got some interest in AI and ML, in previous KubeCons, we've heard about the cloud native and the AI and ML worlds coming together. Since you're in the San Francisco Bay Area, the AI community is also very strong there. Is there anything that you'd like to comment on for other folks that might be interested in this topic, such as things that have caught your attention or things that you're looking at, like Kubeflow?

Yue: Do you mean something specific about Kubernetes or something more general?

Bart: For a while, we've been hearing about how AI and Kubernetes are coming together. How are you seeing that happen in a practical day-to-day sense? What are the technologies that you think all Kubernetes engineers should be aware of when it comes to AI and ML, as you mentioned previously with Kubeflow and particularly on the point of performance?

Yue: This came from attending KubeCon North America 2024 this year, where I felt that many talks focused on how Kubernetes can support large-scale machine learning workloads, such as training or inference, within an organization. The Kubeflow and KubeRay received significant attention in the talks at KubeCon. These are fairly new concepts to me, and I am still learning. I would encourage those interested to explore how Kubernetes supports machine learning workloads by paying more attention to these new technologies, reading blogs, and experimenting with their code to understand how they manage machine learning workloads on Kubernetes.

Bart: Good.

Yue: UA was different last night at least, because they can be very different.

Bart: No, and different. And different in what way?

Yue: Different from traditional microservice workloads.

Bart: I think that's what a lot of this comes down to - the confusion or doubts about how new this is, how practical it is, and how people can extract value from it. However, there are now more resources available than there were six months to a year ago, so it's great for people to go out and see that for themselves. Now, thank you for making it this far. If people want to get in touch with you to speak about all the work you're doing, whether it's at ByteDance or elsewhere in the ecosystem, what's the best way to get in touch?

Yue: The best way to get in touch is to connect with me on LinkedIn.

Bart: Good. For people who want to check you out, we'll leave a link in the show notes. Thank you very much for joining the podcast today, KubeFM, it was great hearing about your experience, and we're glad to see more open-source tools entering the ecosystem. For folks who want to check it out in more detail, we'll leave all the necessary information for them to do so. Yue, thank you very much. We'll speak to you soon.

Yue: Take care. Thank you. Bye.