Topology-aware routing: balancing cost savings and reliability
Host:
- Bart Farrell
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
In this episode, William Morgan, CEO of Buoyant, explores the complex trade-offs between cost optimization and reliability in Kubernetes networking. The discussion focuses on Topology-aware routing and why its implementation might not be the silver bullet for managing cross-zone traffic costs.
William shares practical insights from real-world implementations and explains why understanding these trade-offs is crucial for platform teams managing multi-zone Kubernetes clusters.
You will learn:
How Topology-aware routing attempts to reduce cross-zone traffic costs but can compromise reliability by limiting inter-zone communication
Why Layer 7 load balancing offers better traffic management through protocol awareness compared to topology-aware routing's Layer 4 approach
How HAZL (High Availability Zonal Load Balancing) provides a more nuanced solution by balancing cost savings with reliability guarantees through intelligent traffic routing
Relevant links
Transcription
Bart: In this episode of KubeFM, I had the chance to speak to William Morgan, CEO of Buoyant and core contributor to Linkerd, about the challenges and trade-offs of Topology-aware routing, or TAR. Designed to reduce cross-zone traffic costs, TAR can unintentionally undermine system reliability by limiting inter-zone communication, leaving services vulnerable to localized failures. We also discussed how L7 load balancing offers more granular control compared to L3 and L4 approaches, and why developers should prioritize business logic over navigating Kubernetes complexity. William shared insights on the evolving adoption of Kubernetes, highlighting a shift toward more principled Network Policies and the persistent pain points of managing them. Additionally, we touched on how AI tools can assist in generating Kubernetes configurations and simplifying workloads, potentially utilizing Custom Resource Definitions (CRDs). As the Kubernetes ecosystem matures, we explored how the future of networking aims to balance cost, reliability, and simplicity, ensuring that developers can build resilient applications without unnecessary overhead, possibly leveraging advancements in CNI for better network management. This episode is sponsored by LearnK8s, which provides Kubernetes training online and in person to groups and individuals all over the world. The courses are instructor-led, with 60% practical and 40% theoretical content. Students also have access to the course materials forever. For more information, visit LearnK8s.io. Now, let's get into the episode with William.
William: All right.
Bart: William, there are a lot of tools in the Kubernetes world, which three emerging tools are you keeping an eye on?
William: I'm definitely keeping an eye on Linkerd, which is my favorite tool. I hope it does well. I'm going to be pragmatic about it and say Helm and, probably, a 50-50 split between Argo and Flux. This is purely based on the tools I see used with Linkerd every day in production.
Bart: But not taking a side on Argo or Flux, get operations for an even keel.
William: I'm trying purely to be pragmatic. I don't really care. I'm sure one is vastly better than the other, but I don't know which one.
Bart: We did have a previous conversation about how you got into cloud native and what you did before that. Being in a position of leadership, if there's one piece of advice you could give to other open-source CEOs in this ecosystem, what would it be?
William: The advice I would give is to have a clear model for monetization and be really upfront about that with the community, even though you're going to get pushback about it. I think you have to be out there and say, "This is how this project is going to get funded." If it's an ethical problem for you to pay money for software, maybe this is not the project for you. I think you have to make that really explicit to people in this ecosystem, otherwise all sorts of other dynamics come into play. I will be discussing this topic at KubeCon in Salt Lake City, where I am participating in two panels about this exact same topic.
Bart: For the purpose of today's interview, we came across an article you wrote called "The Trouble with Topology-Aware Routing, Sacrificing Reliability in the Name of Cost Savings." The following questions will be focused on that. To get started, could you walk us through a quick explanation of what Topology-aware routing is in the context of Kubernetes and why it's being widely discussed?
William: So, let's look at the situation. This is a tool that was developed to solve a specific problem. The problem is that I am running a multi-zone cluster. I'm using a cloud provider, and cloud providers have a two-tier system: regions and zones. Different cloud providers have different semantics for what these terms mean, but generally speaking, a region is larger than a zone, and a region failure is less common than a zone failure. This is part of the cloud provider's failure domains.
A common model in this world, and a best practice according to some cloud providers' documentation, is to deploy a multi-zone cluster. This means I have a single Kubernetes cluster, but the nodes are spread across zones, typically three. For example, in AWS, the documentation recommends deploying nodes across all three zones. If one zone goes down, the cluster still functions because two zones remain, and Kubernetes is a distributed system. This provides a nice reliability feature.
However, from the user's perspective, the problem is that the cloud provider charges for traffic that crosses the zone boundary. If you have a small amount of traffic, it's not a significant issue, but if you have a large amount of traffic, the cost can be substantial over time. In fact, you could end up paying millions of dollars every year just for cross-zone traffic. This happens because Kubernetes is doing its job: load balancing and spreading traffic across all nodes for reliability. But then you get charged for traffic that goes across zones. If you have a tiny amount of traffic, it's a negligible cost, but if you have a huge amount of traffic, it's a big problem. That's the situation.
Bart: In terms of the specific problems that Topology-aware routing is trying to solve regarding costs and cross-zone traffic, could you give some specific examples of real-world cases where you've seen this?
William: So, on top of that problem, we introduce Topology-aware routing, a feature built into Kubernetes that can be enabled to prevent traffic from crossing zones. This solves the problem of cross-zone traffic, but it's not a complete solution because we do want cross-zone traffic in certain situations, such as when something fails and we want traffic to recover across zones. Topology-aware routing has a set of rules that disable it in certain situations. For example, if all instances of a service in one zone suddenly go down, it will turn itself off and allow traffic to happen between zones, ensuring reliability. A concrete example is a distributed database, which typically has a set of database instances per zone. If the database goes down in one zone, it's a difficult situation because none of the services can access the database. In this situation, we want the traffic to be handled by the other zones. Topology-aware routing is a feature that promises to prevent cross-zone traffic when things are normal, reducing cloud spend, and disable itself when things fail, allowing traffic to resume between zones and ensuring a stable cluster even if a zone or database is down.
Bart: Regarding the cost, the FinOps element of this is a topic that has been addressed since the beginning of Kubernetes, as well as cloud spending in general. This year, with IBM acquiring a company like KubeCost, how does this figure into the FinOps framework and landscape, compared to other practices being done to monitor, reduce, and optimize?
William: This is an interesting topic because it's an avoidable cost and an arguable one. If you consider why you're doing multi-zone clusters in the first place, you might ask if it really makes sense. The concept of a zone varies significantly between cloud providers like AWS, Google, and Azure. For example, in Azure, there is no charge for cross-zone traffic, although there is a note in the pricing table stating that this may change with three months' notice. In Google, the network between zones is so fast that using Linkerd can actually put you in a worse situation than not using it, because Linkerd performs latency-aware load balancing. This works in AWS, where there is a high network cost between zones, but not in Google. My answer to this question is that this is a more nuanced and arguable cost. You need compute, ingress traffic, and egress traffic, and cross-zone traffic is necessary in some cases. However, you might architect your cluster differently, using one cluster per zone. The world of Kubernetes is evolving, and there has been a shift from preferring larger, multi-zone clusters to having multiple smaller clusters and a kind of federation. This is a complicated topic, and the answer is not straightforward.
Bart: And how does Topology-aware routing specifically impact the ability to manage zone-level failures? Could you explain some of the trade-offs it introduces in this regard?
William: I wrote this blog post not because I thought Topology-aware routing was poorly designed. It's actually very well designed and it uses all of the Kubernetes primitives. Linkerd supports it, so if you enable that feature and you're running Linkerd, then Linkerd will respect its rules. Topology-aware routing works by taking all of these endpoints and laying down the layer 4 networking. It restricts things at layer four, saying if you are in zone A, you're not even going to know about an endpoint in zone B, and if you're in zone B, you're not going to know about a point in zone A.
Over the past year or two, we started paying attention to what our customers were experiencing. We found that we had a lot of customers who were in the multi-zone case with lots of traffic and they were not actually able to effectively use Topology-aware routing. There were two reasons for that. One was that it was hard to get it working consistently. It would disable itself under a whole set of conditions and didn't compose well with auto-scaling, like Horizontal Pod Autoscaling (HPA), or some other features. It was kind of finicky. In order to have that reliability guarantee, they were like, "Hey, if something's going weird, we're going to turn ourselves off because it's better to be reliable than to be cheap." They basically stopped using it, but they still had that problem.
We also had other customers who said it didn't save them under certain conditions. Their apps were failing, but Topology-aware routing didn't turn itself off. When we started digging into that, we realized that because Topology-aware routing is built on top of layer 4 networking, which is fine, it couldn't handle situations that required understanding the protocol, understanding HTTP success rates, or understanding request latency.
To really understand this, you have to understand the model of in-band health checking versus out-of-band health checking. A workload's running in Kubernetes, and we want to health check it. Kubernetes does what's called out-of-band health checking, which means application traffic is happening on port 1234, and the health check is happening on port 4321. The health check is saying, "Hey, are you healthy?" And the app is saying, "Yes, yes, yes, yes."
If you look at the actual code, typically there's a handler function for the health check, and then there's a handler function for the actual application code. The handler function for the health check is typically like "return yes" or "return okay." The handler code for the application calls whatever complicated business logic. You kind of have this decoupling. The consequence of that is that if something starts going wrong with the application, it's not always reflected in the health check.
So, for example, if our database was just taking forever to respond, okay, so it's not failing, it's just taking longer than expected. Our application starts timing out and then starts returning 500 errors. The health checks are going to say, "I'm fine." But the traffic, the user-facing impact, is like this thing is dead. That's the kind of situation that Topology-aware routing cannot capture, not because it's poorly designed, but because it can only make use of out-of-band health checks.
When you move to the service mesh layer, which is true of Linkerd and any service mesh, you suddenly have access to the entire application traffic. You can parse it, you can say this is a 500 versus a 200, or this is a gRPC error versus a gRPC success. Now you can start doing what's called in-band health checking, where you can say, "Hey, this service is running, it's responding okay to its Kubernetes probes, but every time I call it, it's returning a 500." The service is actually not healthy.
Bart: If we're thinking about dependencies, how does Kubernetes' implementation of Topology-aware routing impact critical dependencies across different zones, and what alternatives exist for maintaining reliability in such setups?
William: Topology aware routing itself doesn't really know about dependencies because dependencies are an emergent property of who's communicating with whom. Topology aware routing can be enabled for a service or for all services, but it doesn't know which service is a dependency of another. In fact, Kubernetes itself doesn't know this either, as it would require knowledge of the call graph, which can be mapped out by tracing TCP connections. However, there's no reason for Kubernetes to do this, so typically, this information is deferred to another tool, such as the CNI or the server.
Topology aware routing doesn't know about dependencies; it only knows if a service is healthy and has enough endpoints in each zone. If the service is healthy, it prevents TCP connections from crossing the zone. If something goes wrong, it allows TCP connections to cross the zone.
Linkerd, on the other hand, is in a position to know about these types of failures. It can detect when a service is returning 500s and treat it as a failing service. This sparked the interest in introducing a new feature called HAZL (High Availability Zonal Load Balancing), which stands for High Availability Zonal Load Balancing. HAZL aims to solve the same problem as Topology-aware routing but with full access to traffic semantics between components.
With HAZL, not only can it understand test rates and errors, but it can also capture the 500 return code case. It can be more nuanced and determine how much load is happening to a service in a zone and whether that load is exceeding a threshold. If the load is exceeding the threshold, it might want to broaden the set of endpoints that traffic is sent to.
HAZL was built because it has access to all the layer seven semantics of traffic. It knows how many requests an application is handling and if there are any requests queuing up. If a service is overloaded, HAZL can add more endpoints and start sending traffic across zones. When things return to normal, it can reduce the number of endpoints and shrink the set of endpoints back to the same zone.
With HAZL, you can get the best of both worlds: the cost savings benefit of not sending traffic across zones when things are fine, and the reliability of sending traffic across zones when the system is under pressure. Being reliable is more important than being cost-effective, and being cost-effective is important. However, being reliable is typically the first promise as a platform. To use HAZL, you need to install a service mesh, which is a trade-off.
Bart: We're close to Black Friday and Cyber Monday, days when we often discuss traffic, reliability, scalability, provisioning, and load balancing. All these factors will come into play. Specifically, on the topic of L7 versus L3 and L4 load balancing, in the article, you mentioned layer 7 balancing as an alternative approach. What would be required to make this a viable alternative to Topology-aware routing?
William: I think it's a superior alternative to Topology-aware routing already. Are you talking about L7 balancing or something like HAZL (High Availability Zonal Load Balancing)?
Bart: L7 balancing, which refers to Layer 7 load balancing, involves distributing network traffic across multiple servers based on application-layer data.
William: L7 load balancing simply means deciding where to send requests based on L7 properties. For those not immersed in the network space, L7 refers to understanding the protocol. There is a long and complicated history behind why we arrived at L7, but it essentially means understanding the protocol. Kubernetes does not understand the protocol; it only makes TCP connections. Topology-aware routing, being a Kubernetes subsystem, also does not understand the protocol. To understand the protocol, you effectively need to run a service mesh. For L7 load balancing to be an alternative to Topology-aware routing, you need to be able to balance requests and know which services are in which zone, which endpoints are in which zone, and under what situations you want to send traffic across zones. Normally, you may not want to send traffic across zones, but there may be situations where you do. We built a solution using Linkerd, where we put in enough logic to allow the L7 load balancer to balance across the same endpoints in the zone, unless it detects something unusual, in which case it starts adding more endpoints across the source. You can also put in arbitrary costs, making this solution very intricate. I tried to simplify the user-facing surface area, but there are still ways to make it even simpler as we roll this out into production and see people use it.
Bart: Since you mentioned the end users, the people who will be working with this, we're thinking about developer versus operator perspectives. What do you think could be misunderstandings that developers or operators might have when deploying Topology-aware routing in cloud environments?
William: I have a strong opinion that there's a developer side and an operator side, or more accurately, a platform owner side and a developer side. This model is successful in organizations that use Kubernetes effectively. Our team owns the platform and is responsible for building an internal platform for developers. The developers' job is to write the business logic that powers the application. In this world, I argue that developers shouldn't know about the technology we're running. They shouldn't know about Kubernetes or write Custom Resource Definitions (CRDs) or YAML. That's the job of the platform. As a platform owner, our contract with developers is to provide a reliable, secure, and observable platform. Developers should write their code, press a button, and have it run on the platform without worrying about the underlying technology. We promise to provide a platform that accomplishes these goals, and in return, developers should own their own correction settings and not throw their service over the wall, requiring us to be on call.
Bart: As a developer, you just focus on building business logic and making sure that it can scale. To take this a little bit further, because it's something that came up in a lot of conversations we had when we interviewed 30 CNCF ambassadors about the next 10 years of Kubernetes, things up until now - and the least favorite feature when we ask people about their least favorite Kubernetes features - Network Policies management and things around networking came up a lot. Are you surprised by that? What would need to happen for that not to be the case? What are the things that you're seeing, both in terms of the technical folks that are contributing to projects such as Linkerd, as well as the end users, in terms of the challenges that they're facing, the questions that you find yourself answering a lot? What world do we need to build for networking to become simple?
William: My instinct is that when we start talking about Kubernetes, everything is beautiful and pure. We discuss services, deployments, and objects that make sense to developers, operators, and platform owners. Kubernetes abstracts away individual machines, creating a pool of resources. When using Kubernetes, you don't worry about individual machines; instead, you focus on deploying services and letting the system figure it out. However, with networking, the abstraction melts away, and we start talking about IP addresses, IP tables, and other underlying details like CNI.
My hope for Linkerd, especially in a multi-cluster environment, is that it can provide a layer of abstraction over the network, similar to how Kubernetes abstracts hardware. In a multi-cluster setup, you have to figure out how a pod in one cluster communicates with a pod in another, which is a challenge because Kubernetes doesn't provide many primitives for doing so. This is where a service mesh like Linkerd can help. We're currently working on adding new multi-cluster features to Linkerd, including something called federated services. This feature will allow you to deploy a service across multiple clusters without having to manage it as individual objects. You can simply talk to the service without worrying about its replication across clusters.
We're trying to explicitly capture the idea that you have a service deployed across multiple clusters, but you don't want to have to interface with it as individual objects. We want to build features in Linkerd that ease this process. Federated services, which will be available in Linkerd 2.17, is an example of the kinds of abstractions we can put on top of the network to make it easier to work with. At the end of the day, you don't really care about the underlying network details; you just want to be able to talk about services communicating with each other without worrying about implementation details like Network Policies.
Bart: And looking towards a future that, as you say, could hopefully be boring, do you think that Topology-aware routing will become a standard part of Kubernetes networking or remain niche due to its reliability concerns?
William: Topology-based routing is going to be a standard offering, just because it's already in Kubernetes. There are situations where you just want to turn it on and everything is fine. However, when you are outside of those situations, you need more nuance, more control, and the ability to handle different types of failure. In those cases, you need to use a service mesh. For all the reasons we've discussed, you have to elevate that logic to Layer 7. My advice to anyone is to be pragmatic about these decisions. If you have a situation where a built-in Kubernetes capability solves your problem, then use it. There's no reason to overcomplicate things. Just be aware of the limitations and the situations where you need to do something more, such as using Topology-aware routing.
Bart: Given your experience, what advice would you offer teams weighing the cost savings of Topology-aware routing against potential reliability risk?
William: I would say that if you are in a situation where the cost of cross-zone traffic is a significant number, the CFO or finance team may come to you saying this is too expensive. In such a case, I would seriously consider HAZL (High Availability Zonal Load Balancing), the Linkerd feature, and Topology-aware routing. I recommend reading through my blog posts to understand the situation and how to address it. The cost is avoidable; you do not have to give up and pay it. Instead, you can fix the issue by being clear on the reliability implications and how they apply to your situation.
Bart: As we get towards the end, William, what's the one thing people normally
William: Normally, I'm no longer saying this is a silver bullet that will solve all your problems. Now, I'm trying to be low-key about my opinion. Of course, I believe my solution is the best. You should evaluate all the options, and I'm confident you will end up agreeing with me.
Bart: I always tell people that the less you sell, the more you will. This approach invites further conversations and shows that you're open to debate. I think this is a very healthy and solid approach. You're very active in this ecosystem, and often people get to know others through their involvement in an open source project or a company. What are things that people don't know about William, the person, that you would like them to know?
William: Bart Farrell's guest, William Morgan, has horrible dark secrets, but one thing that's been particularly interesting to him with the resurgence of AI is that it's where he started his career. He went to school for natural language processing, and his first couple of jobs out of grad school were focused on making computers better at handling human speech and text. He got annoyed with that field because it was too hard to create a product, so he moved to infrastructure where it was easier to solve immediate problems. Now, 20 years later, it's funny to see that AI has actually started to work, thanks to 20 years of serious research. There isn't really a moral to this story.
Bart: Just because we saw many conversations around AI at the last KubeCon in Paris, some people might debate how concrete those conversations were and how realistic or tied to real-life situations they were. Where do you see this going? If we're thinking about stakeholders, we're talking about operators, platform owners, versus developers - your average Kubernetes user. To what extent is AI impacting the work they're doing right now? What do you expect to happen in the coming six months to a year, especially with regards to Custom Resource Definitions (CRDs)?
William: When building a platform, and certainly when building Linkerd, the goal is to make things as predictable, cheap, efficient, and understandable as possible. The operator's mental model is that of someone who has to run Linkerd and may be woken up at 3 AM to deal with alarm bells and red lights. In this situation, Linkerd needs to be simple and easy to understand so that the operator can quickly identify and fix problems. This is a high bar to reach in a complicated ecosystem. On the other hand, AI is often the opposite of predictable, with complex neural networks and uncertain outputs. However, developers are getting significant utility out of code generation, particularly in generating YAML. While this may not be a great solution, it highlights the need for higher-order abstractions to accomplish complex tasks in Kubernetes. Currently, these tasks are often accomplished by writing Custom Resource Definitions (CRDs) and YAML, but this can lead to complexity and difficulty in understanding. The use of AI to generate YAML may solve the generation problem, but it does not address the underlying complexity. This speaks to the need for higher-order abstractions and more expressive tools. The fundamental mismatch between predictability and expressiveness makes it challenging to imagine how useful AI could be in this context. Despite this, it is an exciting area to explore, particularly as customers build large platforms and struggle with significant challenges. The fact that everything ends up being YAML at the end of the day suggests that there is something missing, and this is an area where AI could potentially be useful.
Bart: I'm sure there will be plenty of conversations around it, which is why I asked. I know that I asked about this previously, but Network Policies management really stood out in the conversations we had when people were asked about their least favorite Kubernetes feature. What's the thing that you see people getting wrong most frequently when it comes to networking in Kubernetes?
William: When it comes to networking as a whole, I don't see people getting things wrong that are really networking-specific. I think the pattern that I see happening over and over again is that Kubernetes adoption is going through a transformation now. Kubernetes has been around for 10 years, and Linkerd has been around for 9.5 years. We've seen this from the early days, where people were deploying Kubernetes in a piecemeal manner - one cluster from one team, another cluster from another team, and a third cluster from another team. Now, people are taking another look and saying, "Let's be really principled about this. Kubernetes is here to stay, and it's not going to disappear. The ecosystem is still growing, Linkerd is still here, and Fluentd is still growing." They're taking a different approach and redoing a lot of what they did in the past. I wouldn't call that a failure; I'd argue that's probably the right way to do it. You don't want to jump into the ocean before you're ready. That's a pattern that I'm seeing over and over again, and it's really interesting. It has informed a lot of our thoughts in Linkerd, where I talked about our initial multi-cluster approach, which was built in 2019. It was meant for those piecemeal situations, where you have clusters that need to talk to each other without the developers having to know about it. That worked fine, but those semantics start to break down when you're building 200 clusters and want them all to be the same, with one layer of service across all of them. In that case, the tools seem pretty clunky, and you need a different set of tools and a different set of demands. That's part of what's really exciting to me - I feel like there's a lot of evolution that still has to happen in the Kubernetes space, because the way people are tackling it is evolving in a pretty significant way.
Bart: William, what's next for you?
William: Eat lunch, maybe take a nap.
Bart: That's good. You have to do that before you do anything else. Those should not be forgotten.
William: No, we're at a super exciting time in Linkerd land. We just announced that Buoyant and Linkerd are kind of like these two entities, and I've historically tried to separate them. However, what I've learned over the past year and a half is that you have to talk about them as the same story. This is exactly the advice I was getting at the beginning of this conversation. My advice to companies, to CEOs of companies operating in the cloud native space, is to be really explicit about the fact that there's a company and an open source project, and they're interconnected. We just announced that we're profitable, and Linkerd is getting new maintainers, which will supercharge everything. We have a whole set of features that I'm excited about, which will ship next month with Linkerd 2.17. We'll have Egress control, which is the ability to turn off or get visibility into traffic that's leaving your cluster, super important for security reasons. We've also got Rate limiting, and this Federated services feature, which is a direct response to what I was talking about with people trying to build these new Kubernetes platforms. After that, there's a whole roadmap for Linkerd that's full of exciting and interesting features. I look forward to sharing those.
Bart: Fantastic. If people want to get in touch with you, what's the best way to do that?
William: Just send an email to Bart at KubeFM.com is not provided, however, we can use the link to Buoyant as an example. Just send an email to Bart at KubeFM.com. You can also email me, William, at Buoyant. Just make sure to spell "buoyant" correctly, as the "u" and "o" are in a specific order - look it up in the dictionary if necessary. I've been using Bluesky a lot, thanks to Kelsey Hightower, so you can find me there. I've abandoned Twitter, which is emotional for me since I was on it for 15 years and even worked there. I'm giving Bluesky a try, so find me on Bluesky. You can send me an email or come see me in Salt Lake City, where I'll be standing by the Linkerd booth. I've probably done something to upset you, so come on over.
Bart: All right, William. We're looking forward to talking to you soon. Take care. Have a good one.
William: Thanks, Mark.