Intelligent Kubernetes Load Balancing

Intelligent Kubernetes Load Balancing

Apr 7, 2026

Host:

  • Bart Farrell

Guest:

  • Rohit Agrawal

You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.

Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.

In this episode:

  • Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per request

  • How Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every client

  • How zone-aware spillover cut cross-availability-zone costs without sacrificing availability

  • Why CPU-based routing failed (monitoring lag creates oscillation) and what signals to use instead

The system has been running in production for three years across hundreds of services, handling millions of requests.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Transcription

Bart Farrell: In this episode of KubeFM, I got a chance to speak to Rohit, who's a software engineer on the traffic platform team at Databricks and also an Envoy maintainer. We dive into large-scale service-to-service communication in Kubernetes and what happens when gRPC, HTTP2, and persistent connections start pushing beyond KubeProxy's Layer 4 load balancing model. Rohit explains the following points. Why connection-level routing breaks down under high-throughput gRPC workloads, how Databricks built a proxy-less, client-side load balancing system using XDS, why they chose Power of Two Choices, P2C, for request-level routing, how they implemented zone-aware spillover to reduce cross-AZ traffic, and the trade-offs with headless services and service meshes. This is a technical conversation about Layer 7 load balancing, endpoint discovery, and scaling traffic systems across hundreds of services. If you're operating gRPC workloads in Kubernetes, this one's for you. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has trained Kubernetes engineers all over the world with their training courses, which are instructor-led, 60% practical, and 40% theoretical. They're given in-person online to groups as well as individuals, and students have access to the course materials for the rest of their lives. For more information, go to learnkube.com. Now, let's get into the episode with Rohit, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Rohit Agrawal: There are so many, but if I have to pick three I would say Envoy Gateway. It's a Kubernetes native API gateway, which is built on top of Envoy and the Gateway API. I'm personally very close to it because I'm one of the Envoy maintainers, and I have a lot of friends who are making it or working to make it great. It standardizes Ingress across the ecosystem, replaces the patchwork of Ingress controllers. And I would say if you have done Envoy before, you don't want to manage all that XDS configurations by yourself. If you haven't seen it, please do explore it. The other thing I'm excited about is the k8sGPT. So they have a k8sGPT. It's really good if you have configurations which you're finding it really hard to interpret. You can go ask the help. It just gives you an LLM interface to talk to Kubernetes. Really great tool. I was exploring it. I was absolutely fascinated by what it can do. And I would say third one is the OpenCost. cost is becoming very important. And then each team should have visibility into what their service is costing in terms of services, namespaces, teams themselves. So I think at Databricks, we are trying to bring in that view to the people that they can just look at the cost and they can make more of their decisions. I would say, OpenCost is, again, one of the tools I would recommend.

Bart Farrell: Fantastic. Rohit, you mentioned that you work at Databricks, but can you tell us a little bit more about what you do there, your role, things like that.

Rohit Agrawal: So I work as a software engineer at Databricks. I am part of the traffic platform team. So my team is responsible for everything coming in and out of Databricks. So we provide the fabric for all the traffic to come in, hit the services. So I would say I'm just, if you're connecting to Databricks notebooks, running a job submission, all that traffic is going in through our systems. My job is to just get it to the right place as fast as possible and securely.

Bart Farrell: Okay. And how did you get into cloud native in the first place?

Rohit Agrawal: It's a very interesting story. When I first joined Databricks early 2019, the team I was put in, the traffic team, we had efforts going on to bring in Envoy. And I'm very lucky that I got involved early in the days in those efforts. My first cloud native or open source contribution was to remove this hard coding in Envoy. So what was happening was, if you have a 404 or if you have 503, Envoy was returning a hard-coded response. But then at Databricks, the proxy that we were using before, it was written in-house in Scala. It used to return a JSON. And our clients were behaving on this or depending on this behavior because based on the error code of the JSON return, they would retry. So we wanted the same behavior from Envoy. And then it was like my first open source contribution that make that response customizable. So we can give you the same JSON that the API gateway before was returning. I think that was my entry point to Cloud Native.

Bart Farrell: And in terms of the role and kind of technologies that you were working with before you got into Cloud Native, can you tell us more about that?

Rohit Agrawal: I started my career as a backend engineer. So I was mostly working on Java. And I eventually became a full stack engineer when I joined Amazon. So I picked up React and a few more frontend technologies. But I have mostly been a backend developer. And I would say Java was the primary language that I was using for years.

Bart Farrell: And it's no secret that the Kubernetes and cloud native ecosystem moves very quickly. And on top of it, you're a maintainer. How do you stay up to date with all the different changes that are going on? What are your go-to resources?

Rohit Agrawal: I would say attending KubeCon, EnvoyCon, both as a speaker and as an attendee is amazing. It gives you a lot of information about what's going on. What are the new emerging tools? You meet a lot of awesome people. That's a great resource. Lately, I've been getting away from books. It's not that easy. Podcast is an amazing way. I would just download a few episodes. And then if I'm commuting or if I'm in the plane and I don't have anything to do, I would just listen to the podcast. I just go to YouTube and then start watching videos of KubeCons or other resources out there that I have not attended in the past.

Bart Farrell: And Rohit, if you could go back in time and share one career tip with your younger self, what would it be?

Rohit Agrawal: I would say start contributing early, as early as possible. If I could just go back and tell my younger self, I would say 100% start contributing to open source earlier. It's just so important to build that DNA. And it's fascinating. You meet so many people, you make so many decisions, and you see why open source software is so great. Because there are people from all these companies, they bring fresh perspective that you probably cannot get in your limited daytime job. Because everybody's talking the same language. So yes, 100%, start contributing to open source early.

Bart Farrell: Now, as part of our monthly content discovery, we found an article that you wrote titled Intelligent Kubernetes Load Balancing at Databricks. So we want to get into this topic a little bit more. Databricks operates at a massive scale with hundreds of services running in Kubernetes clusters. Can you set the stage for us? What does service-to-service communication look like at Databricks? And why did the default Kubernetes networking primitives start showing their limits.

Rohit Agrawal: So at Databricks, within a single Kubernetes cluster, we have hundreds of stateless services communicating with each other. These aren't simple REST APIs. They are high-throughput gRPC services, powering everything from our notebooks to job scheduling to model serving. For a while, the standard Kubernetes networking stack, CoreDNS, and all the service discovery using KubeProxy for load balancing via IP tables, it was fine. It's elegant, it's simple, as we scale our traffic, we have this mixed workloads, we have REST API, we have gRPC calls, we have WebSockets, it started to fall apart. Our gRPC services use HTTP2, which means we have persistent connections, and KubeProxy picks backend once per connection, not per request. So traffic do get stuck based on the pod that it gets landed on. We couldn't trust that the load balancing is evenly distributed. And as we used to see, some pods were running really hot and others were running cold. So we estimated 75K per month per region bill by over-provisioning the capacity. So we have to highly over-provision, I would say 20, 30% to all our services.

Bart Farrell: For listeners out there who might not work with gRPC daily, the article highlights that HTTP2's Persistent connections create a specific problem with Kubernetes Layer 4 load balancing. What exactly happens when KubeProxy makes routing decisions at the connection level rather than the request level?

Rohit Agrawal: Right, so let me walk you through an example. Say if your service is making a call, you would have a cluster IP address. It would be myservice.namespace. It can be the default or service cluster local, or DNS, resolve that to virtual IP. The request hits the node and KubeProxy iptables rules pick up a backend pod. It's so far so good. The problem is when the decision gets made. So KubeProxy, as you said, it operates at layer 4, the TCP connection level. So it picks the backend pod once. And then when the TCP connection gets established, it's fine because connections get recycled frequently. So for HTTP 1, it's not a problem. But then for HTTP 2, since you have a lot of requests multiplexed on the same connection, You can have long-running requests where clients are just making smaller HTTP requests over the same connection. So some connections which have active traffic could be very different from the connections that are idle. So that creates a mismatch. You can no longer do connection-based load balancing because you might end up choosing pods which have empty capacity versus which have a lot of capacity.

Bart Farrell: Beyond tail latency, you mentioned inefficient resource usage and limited load balancing strategies as pain points. How did these limitations manifest in practice at Databricks?

Rohit Agrawal: So the most visible symptom for us was the CPU. We would see some pods running over 80% and then others were at 20%. So it created an uneven balance. We had to throw more pods at the problem so that we don't run away from the situations like this, that some of our fleet is running hot and versus some of the fleets are running cold. Kube proxy also gives you a nice round-robin or random load balancing. That's it. But there is no weighted routing for gradual rollouts. There is no error-aware routing that some of my pods are returning errors. So let me just pick different backends. It would just keep continuing to hit the same pods. When we measured the impact across several Databricks services, we estimated roughly 20-30% over-provisioning just to accommodate the uneven distribution. And if you multiply that across hundreds of services that we were running, it just adds up fast.

Bart Farrell: Your team's solution was a proxy-less, client-side load balancing system. This is a significant architectural shift from relying on infrastructure-level routing. What was the core insight that led you toward putting load balancing logic directly in the client?

Rohit Agrawal: So I think the fundamental insight was that we needed to make routing decisions at L7, which is per request, not per connection. Because as we were talking, if you start making decisions for a connection and if you stick to it, you have so many requests happening. Some connections might be overloaded as compared to the other connections. So the basic fundamental was we needed to make these decisions on L7. Layer 4 load balancers simply cannot just look inside the HTTP or gRPC traffic and route request. The second decision was removing DNS from the picture. We have DNS caching, TTL, stale entries, all these were creating a lot of lag when the cluster topology changes. The second decision that we took was just remove the DNS and reinvent that layer. So the architecture became build a control plane that watches Kubernetes directly and push endpoint updates to clients in real time. And we were using XDS protocol for that. So the same discovery protocol that Envoy uses, client would subscribe once and then they get streaming updates as the pod comes in.

Bart Farrell: The control plane you built, the Endpoint Discovery Service, continuously watches the Kubernetes API for changes. Can you walk us through how it works and what data it provides to clients?

Rohit Agrawal: So control plane is a lightweight service that we call Endpoint Discovery Service. It's actually a pretty standard term. If you look at XDS, it's nothing but a set of discovery services. And Endpoint Discovery, or EDS, is used for the endpoint discovery. It will continuously watch the Kubernetes API. Specifically the service and the EndpointSlices. These are the two Kubernetes resources. And it maintains a live view of every back-end pod that you have in the service. So it's not just the IP address, though. I would say for each endpoint, we track metadata, which is the availability zone you are in, whether it is ready or not, what shard it belongs to, what is the current health status. The metadata is what enables the smart routing strategies. We would translate all this into XDS responses, specifically this thing called Cluster Load Assignment Resource. And when a client starts up, we need to talk to the downstream services. It would just subscribe to the EDS for that service. And then from that point on, updates are pushed automatically. So if a pod scales up, the client knows within seconds that the pod has scaled up. The protocol, as I said, we use XDS, which is the standard onboard discovery protocol. This was a deliberate choice because we wanted to be in sync with what we are doing through Envoy, because we have an API gateway all built on top of Envoy, which we use pretty heavily. And then the internal RPC services, which are speaking Scala, and we just wanted everything to be in sync.

Bart Farrell: You had a strategic advantage in that Databricks services are predominantly Scala-based with a common RPC framework. How did the shared foundation influence your implementation approach? And what does the client integration actually look like?

Rohit Agrawal: So as you said, Databricks has a very strategic advantage that we made this approach possible. The vast majority of our internal stack is written in Scala. Historical reasons like Spark was written in Scala and then our internal services continued using Scala. They all use a common RPC framework, which is based on something called Armeria. It's another open source project. Because there was just one shared library, we could embed the service discovery and load balancing logic directly in that library. So one library change and every service get intelligent load balancing. So there is no sidecar to deploy, no config to manage per service. The client integration is transparent. When a service makes RPC call, the client can subscribe to the endpoint update from EDS, maintains a dynamic list of the healthy backends with their metadata, and makes per-request routing decisions. So it completely bypasses the DNS and Kube proxy. It always has a live, accurate view of the service topology.

Bart Farrell: For the actual load balancing algorithms, you settled on power of two choices, P2C, as a default strategy. For those unfamiliar with P2C, Can you explain how it works and why it proves so effective?

Rohit Agrawal: P2C is beautifully simple. So when a request comes in, instead of scanning all the backends to find the least loaded one, which is very expensive and it creates contention, you randomly pick just two. Then you send the request to whichever of those has fewer active connections or lower load. The math behind is fascinating. So the power of two choices is it's a well-known result for computer science. It shows that choosing the better of two random options is exponentially better than choosing one at random. So it is, I would say in plain English, it's almost completely eliminating hotspots. And for us, the results were dramatic. We went from visibly uneven QPS across pods, some getting hammered, other idle, to near uniform distribution. The before and after charts are striking. And I would say we considered more complex algorithms. But P2C struck the right balance. It's simple to implement low load overhead per request, and it just works. We found that keeping it simple and consistent across services was more valuable than fine tuning it per service. We did try some fancy load balancing algorithms, but then results were, I would say almost similar to what P2C gave us. So we just stuck with P2C.

Bart Farrell: Zone affinity routing is another strategy you support, which becomes critical for reducing cross-zone costs and latency. How does zone-aware routing work in your system? And how do you handle scenarios where a zone becomes overloaded or lacks capacity?

Rohit Agrawal: So zone affinity is about minimizing the cross-availability zone traffic. So in cloud environments, sending traffic across AZ is cost-efficient because you pay everything in and out. And it also adds extra latency because it's a networking call. Our system knows which zone each client and each backend pod is in. That's the part of the metadata that EDS provides. So by default, we prefer routing to backends in the same zone. But you can't just blindly pin traffic to local zones. What if a zone has fewer pods or some pods are running unhealthy? So we built a spillover mechanism. When the local zone can't absorb the load, maybe it's at capacity or if we have too many unhealthy endpoints, traffic will spill over to the healthy zones. It's a balancing act between affinity, preference, and availability. The system will just continuously evaluate zone capacity and adjust routing and weights accordingly.

Bart Farrell: Your control plane also speaks XDS to Envoy for managing external and ingress traffic. Why was it important to have both internal clients and gateway-level routing share the same sources of truth?

Rohit Agrawal: I would say consistency. That's the biggest reason. We wanted to design a single protocol that both Envoy and our internal client speak. It just eliminates all the burden on the service team to maintain two discovery systems or two systems doing the same job. By implementing EDS and speaking the same XDS protocol to both RPC as well as Envoy, we ensured that our internal services service traffic and external traffic sees the exact same endpoint data. We just have a single source of truth. And I would say as an Envoy maintainer, this was a natural choice. We also want to contribute back and make XDS better. So if we found any bugs or anything, we could just contribute back.

Bart Farrell: Now, let's talk a little bit about results. After rolling this out, what improvements did you observe in request distribution, latency profiles, and resource efficiency?

Rohit Agrawal: I would say the biggest thing we noticed is uniform distribution. That was the most visible change for how QPS distribution change. So before EDS, if you look at the dashboard for a service, you would see jagged, uneven requests, waves across pods. But then after we rolled out the endpoint discovery, it's just a flat line. You have all the same RPC load over the fleet. The pods could be 20, 30, 40 pods, 50 pods. We were seeing the exact same load. The other big one was latency. The P90 latency for us became stable and more predictable. The long tail behavior that we used to see before, it was gone. Some requests prior to EDS were taking 10x longer because something bad happened to the pod, you would just get stuck and we would keep sending traffic to that. It went away. Efficiency. So as we mentioned before, we were over provisioning because there was uneven load balancing. We started, our fleets are pretty balanced now. We don't do any over provisioning. We trust the load balancing and we trust how EDS would behave if you have more traffic coming in. We have HPA, so you would just auto scale. And I would emphasize these aren't just synthetic benchmarks. The system has been running for Databricks for last three years. And these the real are production numbers for services that are handling literally millions of requests.

Bart Farrell: The rollout wasn't without challenges. You discovered that server cold starts became a bigger issue once client-side load balancing was enabled. What was happening there and how did you address it?

Rohit Agrawal: This was actually an unexpected consequence that also surprised us. So before EDS, with the long-lived connections, new pods would barely receive any traffic because existing connections would be stuck to existing pods. And then pods had plenty of time to warm up, fill caches, initialize connection pools. But once we switched to per request load balancing, new pods started to get traffic almost immediately after we become ready. And ready in Kubernetes terms doesn't mean fully warmed up. We just start seeing elevated error rates from pods that were still warming up their caches or still establishing connections. We solved it in two ways. First, there is a slow start. So you do a slow start ramp up when new pods get gradual increase in traffic over a configurable window, say five minutes. And second thing we did, we biased traffic away from the pods that were showing higher error rates. So if a cold start pod would return errors, the load balancing would just pick other healthy pods. And this also reinforced the need for a dedicated warm-up framework, proactive warming up services before they entered the load balancing pool.

Bart Farrell: You also experimented with metrics-based routing using signals like CPU usage, but ultimately moved away from it. What made that approach unreliable?

Rohit Agrawal: We had this attractive idea, what if we could route based on real-time CPU, send requests to least loaded pods as we measured by actual resource consumption. It sounded great in theory, but then it fell apart in practice. The core problem was the monitoring systems. So we have different SLOs that serve for the serving workloads. Prometheus scrape interval might be 15 or 30 seconds. But in 30 seconds, a pod load can change dramatically. So CPU metrics, I would say, are trailing indicators. They are not real-time signals. By the time we see high CPU and metrics, we might already be recovering. By the time we route away from it, we might already be creating oscillation, everyone routing away. The pod goes idle, then everybody runs back. So we learned to rely on signals that are directly observable in the request path, connection count, error rates, response latency, rather than the infrastructure metrics that are sampled on a different cadence.

Bart Farrell: Before settling on client-side load balancing, you evaluated alternatives like headless services and service meshes like Istio. What were the key limitations you found with headless services?

Rohit Agrawal: So headless services gives you direct pod IPs via DNS instead of cluster IP. So I would say that's a step in the right direction because clients can see the individual pod IPs. But in practice, there are a couple of things that held us back. First, there is no endpoint weights. So with headless service, every pod is equal in the eye of DNS. So you can't say this pod is warming up, give it less traffic, or this pod has higher capacity, give it more. Second thing is DNS caching. It creates staleness. So client cache DNS responses sometimes very aggressively. So a pod becomes unhealthy and get removed from the endpoint list. But client keeps sending traffic to it because, your TTL for DNS is, say, it's 30 seconds. So I would say in a fast moving environment with frequent deployments, this causes real problems. And then the third thing we saw was, which was actually a deal breaker, the DNS records carry no metadata. So they're just IP addresses, right? You can't encode which zone the pod is in. Or which shard it belongs to, or what is the health status. And without that metadata, the zone-aware routing and topology-aware strategies are almost impossible.

Bart Farrell: Istio and service meshes offer powerful layer 7 features, but you decided against them. What factors made the sidecar-based approach unsuitable for Databricks' environment?

Rohit Agrawal: So Istio is really powerful and might be a right choice for many organizations, but it wasn't the right choice for Databricks at that time. And let me explain a little bit more. So first thing is operational complexity. Istio injects an Envoy sidecar into every pod. At our scale, hundreds of services, thousands of pods. That means managing thousands of sidecars. The traffic team here is really small, and as a small infrastructure team, that operational burden was impossible. Second thing, the performance overhead. Each sidecar adds some CPU, some memory cost per pod, latency for every request that has to traverse the proxy. At our throughput level, it all adds up. The third thing is the specific environment, I would say. Databricks already has proprietary systems for things like certificate distribution, MTLS. And Bringing in Istio means we just reinvent all those pieces through Istio because Istio has a very different way of how it does certificate provisioning, certificate distribution, and everything. And I would say most importantly, our environment is very heavily Scala based. It's not like we have some proxies running and then some services are Scala. It's 95% Scala. So in that strategy, it's just you're attaching a sidecar to every single service. I think one advantage for this sidecar mesh is language agnostic, but in this case, we wouldn't have been benefited because 95% is already Scala.

Bart Farrell: Looking ahead, you mentioned cross-cluster and cross-region load balancing as an area of exploration. What challenges is extending this system beyond a single cluster introduced?

Rohit Agrawal: So today our systems operate within a single Kubernetes cluster, but Databricks manages thousands of clusters across multiple regions globally. So the natural next step for us is extending intelligent load balancing beyond cluster boundaries. We are already exploring flat L3 network across clusters, multi-region EDS deployments. So a client in one cloud cluster can discover and route to endpoints in another cluster with the same zone-aware and health-aware strategies. The challenges, I would say, are non-trivial. Latencies between regions means we need more sophisticated failover logic. Network partitions become a real concern for us. And then the state synchronization problem also gets much harder. We are also looking at more advanced load balancing strategies for AI workloads, weighted routing, intelligent scheduling to handle the unique resource demands for some of the model training and inference. The goal for us is a robust cross-cluster traffic management, which is fault tolerant and globally efficient resource utilization.

Bart Farrell: For teams facing similar challenges with gRPC and persistent connections in Kubernetes, what key lessons or advice would you share from this journey?

Rohit Agrawal: I would say understand the layer mismatch. So if you're running a gRPC or any protocol with persistent connections on Kubernetes, L4 load balancing fundamentally cannot solve your problem. You need L7 per request decision. So don't fight it. You have to embrace it. You should start simple. P2C is a surprisingly effective choice. Don't over-engineer your first load balancing strategy. Get the architecture right, real-time service discovery, per-request routing, and start with a simple algorithm. You can always sophisticate and engineer it from there if you want. Consistency over customization. So we found that keeping the same strategy across most services was more valuable than fine-tuning per service. So consistency also makes debugging easier and reduce operational surface area. You can validate with the real data, run simulations, compare before-after metrics. We validated that load was evenly distributed and the tail latency, error rates, and cross-zone cost stayed within target thresholds. And consider your language ecosystem. If you're using one language, just like Databricks, across your environments and a shared framework, trying to say load balancing is incredibly effective.

Bart Farrell: Rohit, what's next for you?

Rohit Agrawal: So I would say on Databricks side, I'm pushing forward on cross-cluster load balancing and building out a dedicated Envoy team. We are also exploring how to optimize traffic, routing for AI workloads. The request patterns for model serving are very different from the traditional microservices. And on the Envoy side, I'm excited about something called dynamic modules. It's a new technology that we have been working with a few companies. It makes Envoy extensible with native Rust code without recompiling. I'm also co-hosting EnvoyCon in 2026. It's upcoming. And personally, I'm just excited to contribute more and more to Envoy and Envoy Gateway.

Bart Farrell: And if people want to contribute to Envoy, or if they'd like to get in touch with you to speak about this more in detail, what's the best way to do that?

Rohit Agrawal: You can find me on LinkedIn. I can share my GitHub handle as well. My email is myfirstname.lastname at databricks.com. And I'm always happy to chat about Envoy, load balancing, networking infrastructure. Please feel free to reach out to me.

Bart Farrell: I can say it definitely worked in our case. Rohit, this is your first podcast. It looks like you've been doing this for years. So keep up the amazing work. It was great talking to you. I hope our paths cross again in the future. And if not, see you at EnvoyCon. Take care. We'll speak soon. Cheers. Thank you so much, Bart. Thank you so much for having me.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via