I just want mTLS on Kubernetes

I just want mTLS on Kubernetes

Host:

  • Bart Farrell

Guest:

  • John Howard

Dive into the world of Kubernetes security with this insightful conversation about securing cluster traffic through encryption.

John Howard, Senior Software Engineer at Solo.io, explains the complexities of implementing Mutual TLS (mTLS) in Kubernetes. He discusses the evolution from DIY approaches to Service Mesh solutions, focusing on Istio's Ambient Mesh as a simplified path to workload encryption.

You will learn:

  • Why DIY mTLS implementation in Kubernetes is challenging at scale, requiring certificate management, application updates, and careful transition planning

  • How Service Mesh solutions offload security concerns from applications, allowing developers to focus on business logic while infrastructure handles encryption

  • The advantages of Ambient Mesh's approach to simplifying mTLS implementation with its node proxy and waypoint proxy architecture

Relevant links
Transcription

Bart: In today's episode of KubeFM, you get a chance to speak to John Howard, who is a senior architect at Solo.io. We talked about Kubernetes security, Service Mesh, and workload encryption. John explains the evolution of Istio, the role of Mutual TLS (mTLS) in securing cluster traffic, and the trade-offs between DIY security implementations and Service Mesh solutions. We also discuss Ambient Mesh, its impact on simplifying encryption, and how it compares to Cilium's network encryption approach. John provides insights into the challenges of scaling security across Kubernetes environments, the complexities of implementing identity-based Network Policy, and how emerging standards like Network Policy aim to improve security. We also examine the performance implications of different encryption methods, such as WireGuard, IPsec, and TLS (Transport Layer Security), as well as their impact on Kubernetes networking. Finally, John shares his perspective on the future of Service Mesh adoption and the evolving role of security automation in cloud-native environments. This episode is sponsored by LearnK8s. LearnK8s has provided Kubernetes training all over the world since 2017. Courses are instructor-led and are 60% practical and 40% theoretical. Students also have access to the course materials for the rest of their lives. LearnK8s provides training both online and in-person. For more information, check out LearnK8s.io. Now, let's get into the episode. So, John, welcome to KubeFM. Can you tell me what three emerging Kubernetes tools you're keeping an eye on?

John: Hey, welcome. Thanks for having me. I'm looking at three tools, maybe an odd selection. However, one of the things I'm most looking forward to is not new tools, but rather Kubernetes itself and the new developments happening within it. I often think or hope that Kubernetes is the stable, boring thing that is already doing everything I need, and I don't need to worry about what's going on with it anymore. But every release or two, they have something pretty compelling that's always exciting. For instance, past releases include native sidecar support, which is huge for me, as well as replacements for webhooks, such as validating policies and mutating policies, which are also cool. So, there's always new stuff that kind of enhances existing use cases. Some other notable mentions include KindNET, which is a newer CNI (Container Network Interface) that reimagines what Kubernetes networking would look like in 2025. It checks all the boxes of being super simple, featuring a lot of features, and having super high performance, contrasting with some of the other previous implementations that may be high performance and have a lot of features but are super complex. I'm also looking at older projects, not newer ones, such as Prometheus, which has been around for many years. I look through their releases to see what issues they're addressing, even eight years into the project, and what they've gotten rid of – the low-hanging fruit – and what's left. Often, they're tackling issues that are still pending resolution.

Bart: For those who may not know, can you give us a quick introduction - a bit more about who you are, what you do, and where you work, specifically at Solo.io?

John: Hi, I'm John Howard. I am a senior architect at Solo.io. We are a company that works on cloud connectivity, so we do Service Mesh and API gateways. I personally spend most of my time working on Istio, which is a CNCF Service Mesh project, that I've been working on for about six years now, previously at Google and now at Solo.io.

Bart: Fantastic. Can you tell me a little bit about how you got into Cloud Native?

John: I joined Google out of college, and they have a team assignment process. I expressed vague interest in back-end work, and they assigned me to Istio. This was around the time Istio released its 1.0 version, an exciting period to get involved in the project. Although it was solid and had a lot of hype, there was still much work to be done to make it what it is today. It was a pretty interesting thing to get involved in.

Bart: The Kubernetes ecosystem moves very quickly. You mentioned previously taking a closer look at projects over the past year. How do you stay up to date with all the changes that are happening? What resources work best for you in terms of podcasts, blogs, tutorials, or anything similar, such as Hacker News or Reddit?

John: On big projects I'm already aware of, I subscribe to them on GitHub to get notifications about releases. For other things, it's impossible to keep up with all of them, as there are so many. I rely on Hacker News and Reddit to see what people are working on.

Bart: And if you could go back and share one career tip with your younger self, what would it be?

John: Forget about Cloud Native and go work at NVIDIA or something. But more seriously, remember that it's a long journey. Don't worry about optimizing for getting something done this week or next week if it means doing it poorly. Every time I've taken the time to do something that will pay off over a long period, it has paid off. However, it's often the scary choice to make at the time.

Bart: All right. As part of our monthly content discovery, we found an article you wrote called "I just want TLS on Kubernetes". The following questions are designed to explore these topics further. Let's start with Kubernetes security. Many teams want to secure their cluster traffic, but they're unsure where to start. What's a typical request you hear from Kubernetes users regarding traffic encryption, specifically TLS (Transport Layer Security)?

John: The requirements for security vary greatly. Sometimes, clients are unsure about what they want and ask for security without specifying their needs. Other times, they have a detailed document outlining their requirements, including specific encryption protocols, cipher suites, TLS (Transport Layer Security), versions, and FIPS compliance checklists. Ultimately, most clients want their traffic to be authenticated so they can write policies against it and achieve a zero-trust network setup. To me, a zero-trust network means that rather than allowing anyone to connect to anything, every hop along the way authenticates and authorizes who can connect. For example, a single service should only receive traffic from the front end, and we would authenticate the request to ensure it comes from the front end. Additionally, the front end may not be allowed to access certain ports, such as admin ports, which are reserved for administrative purposes. The goal is to have a secure network, but the path to getting there can be long and winding. On the other end of the spectrum, compliance is often about checking boxes on a list, particularly for government compliance, which may involve Network Policy, Mutual TLS (mTLS), and Service Mesh.

Bart: Now, in the article, you mentioned specifically mTLS. For those who might not be familiar, can you explain what mTLS is and why it's considered important in Kubernetes environments?

John: mTLS is mutual TLS (Transport Layer Security) or transport layer security. TLS (Transport Layer Security) is super common. If you're listening to this, it's probably over TLS (Transport Layer Security); any website you go to these days is all over HTTPS, which is TLS (Transport Layer Security). Mutual TLS (Transport Layer Security) means we're adding a reverse layer of authentication. When I go to a website like example.com, I'll get a certificate that they present, which has their identity, example.com, in the domain. Then I verify that and know it's trusted because there's a list of trusted CAs that I have on my machine. Mutual TLS (Transport Layer Security) means that as the client, like my browser, I'll also give you a certificate. I say, "Hey, I'm John. Here's my certificate." Then the server can authenticate the client. In the web world, no one ever does this. Technically, you can do it, but if a pop-up on my browser came up and said "Attach your client certificate" when I visit a website, I would be concerned. We've solved web authentication at a different layer by doing things like logins, but it's still a potential option that's useful for the infrastructure layer. In Kubernetes, for example, this would mean that any workload-to-workload traffic not only has the server serving a service certificate, but the client is also presenting one that allows the server to authenticate them. Going back to the front-end to back-end scenario, if we want to authenticate and authorize them, the front end would actually have its own certificate that it would present to the back end to verify who it was and where it came from.

Bart: When it comes to implementing Mutual TLS (mTLS), some teams might consider a do-it-yourself approach. What does a DIY implementation of mTLS look like in Kubernetes, and what challenges might teams face?

John: It's tricky because on the surface, it seems simple. Anyone developing an app can probably figure out how to make it serve TLS (Transport Layer Security) instead of plain text, but the details and reality at broad scale are often far different. At a very high level, you'd have to first get all your applications with a TLS (Transport Layer Security) certificate. This alone can be extremely challenging. There are some great tools in the ecosystem, like Cert Manager and Spire, that can make this much simpler, but often people are DIYing this as well, which adds even more complexity. Then you need to get that certificate to all the applications and update them to use it. Oftentimes, when people are trying to meet their requirements, they don't just require one or two applications to have TLS (Transport Layer Security); they want the entire cluster to have it. This includes third-party applications and legacy stuff that you don't want to touch and update anymore. Everything needs to be updated. You also need to update all your clients to start using TLS (Transport Layer Security), and they're going to have to start attaching a client certificate as well. You're going to want to ensure that you don't somehow have an outage between these steps because if you're serving TLS (Transport Layer Security), you're not serving plain text. So, how do you manage that transition across the entire cluster, maybe even across the entire company, with hundreds of thousands of workloads? That can be extremely challenging. You'll probably also want to make sure that you've implemented certificate rotation properly. It's kind of the worst outage to have a timer just go off at midnight or something and suddenly you have an outage across all your environment because you forgot to rotate all your certificates that were humming along fine for 90 days or whatever the lifetime was. You'll want to understand what traffic is encrypted and what is not. You'll need to do all this sort of stuff outside of Kubernetes. Oftentimes, what I would see people do is put a reverse proxy in front of their applications, like nginx or caddy, and say this will take care of TLS (Transport Layer Security), and they just kind of offload it from their application. But in Kubernetes, that is not as common to do on every single application, and you still have most of these problems anyway. It's actually even more complex than just all this because, like in standard TLS (Transport Layer Security), you have an intrinsic link between the domain name and the certificate. If you go to example.com, DNS helps you find where that is, and then it presents a certificate that's example.com. But if you're an application, and you get a request from someone, how do you know who should be able to talk to you? There's no standard linking of a client identity. There are some de facto standards like SPIFFE, which is tied to the Spire project, that defines a loose scheme for workload identity. But it's not just out of the box and automatic. You have to come up with a naming scheme. For example, in Istio, we have a naming scheme that's like namespace slash service account slash the service account, which is fine, but we had to come up with that. And then you need to go update all your applications to understand that. So, it's just a kind of ever-growing list of steps you need to take. I don't see this as successful in large organizations.

Bart: And given the complexities of the DIY approach, many turn to Service Mesh solutions. Can you explain what a Service Mesh is and how it typically implements Mutual TLS (mTLS) in Kubernetes?

John: Instead of doing this in your application, putting a reverse proxy in front of the application and offloading the TLS (Transport Layer Security) to that Service Mesh is kind of like that, but a better or at least more optimized way to handle this use case. I view Service Mesh as a way to offload various tasks from your application. We're talking about TLS (Transport Layer Security) here, but it could be a whole list of things like observability, logging, metrics, traces, reliability, retries, timeouts, and Service Discovery, as well as authentication with Mutual TLS (mTLS) and authorization. If I were to put all these features into every single one of my applications, it would take a lot of time and effort. If I have a lot of different languages, it would amplify that time and effort. If I have third-party applications, it would be even more challenging. What Service Mesh does is say don't worry about doing all that stuff in your application; keep the application simple and focus on the business logic. Alongside your application, we'll put a small proxy that will handle all that for you. All of your traffic in and out of your application will go through this proxy, and you can offload that. This can be upgraded outside of the application life cycle, and it can be dynamically configured, so you get a lot more flexibility and simplification in your application. Specifically, with regards to Mutual TLS (mTLS), Service Mesh can offload the Mutual TLS (mTLS) as well, and your applications don't need to change at all to do Mutual TLS (mTLS) because the infrastructure does it at the Service Mesh layer.

Bart: While Ambient Mesh sounds promising, it is not without criticism. One of our podcast guests, William Morgan from Buoyant, has raised some concerns about this approach. He mentioned the potential single points of failure, such as restarting the daemon set, which could affect the networking for several pods, and encapsulation, since the pod now depends on an external cluster.

John: With Service Mesh, several architectures have been explored. The earlier, most promising one was the sidecar pattern, which meant that for every single application, like each pod, you would have a separate container called the sidecar container that would run alongside that application, with a one-to-one relation between the application and the sidecar serving the traffic. A newer approach being explored in Istio, which has been GA for about six months now, is what's called Ambient Mesh. This approach brings the service mesh out of the pod and splits it into two layers: a node proxy (ztunnel) layer that handles L4 and encryption, and a waypoint proxy layer that handles the rich feature set. The idea behind this was to reduce overhead in terms of resources, complexity, and operations, making it more invisible to users and applications, and not tightly coupled to the lifecycle. This approach also aims to increase performance and reliability. Concerns have been raised, and while there are good points to consider, any software has trade-offs. For most people, the trade-offs in Ambient Mesh are beneficial, especially for use cases like enabling mTLS on Kubernetes. For example, the reliability concerns about single points of failure are valid to some extent, as all traffic flows through one binary. However, this binary joins a long list of single points of failure required to keep a node functioning, such as Kubelet, the kernel, and kube-proxy. In Kubernetes, if any of these components fail, the node is down. But that's why you have multiple nodes - the aggregation of all nodes makes Kubernetes reliable. Kubernetes is designed to have ephemeral nodes, and service meshes can help ensure that traffic is not sent to failed workloads or allow failover to other clusters. While it's true that a node could theoretically go down due to an issue with the node component, the node component is designed for high availability and reliability. If a node does go down, the cluster is not down, and service meshes can help mitigate the impact.

Bart: Some CNIs (Container Network Interface) offer network encryption options like IPsec or WireGuard. Why aren't these considered equivalent to TLS (Transport Layer Security) for securing Kubernetes traffic?

John: It's an interesting question because oftentimes there's a mixing of the protocol itself and the implementation of the protocols. If you're looking at something like Mutual TLS (mTLS), you're often looking at something like a Service Mesh like Istio or Linkerd. In contrast, if you're looking at IPsec or WireGuard, you're looking at something like Cilium or Calico. In general, IPsec and WireGuard tend to be node-to-node encryption, which is different from workload-to-workload encryption. With node-to-node encryption, you don't actually verify the identity of a specific workload; you only have the node information. This results in a less Zero-trust network model, as it's not a finely-grained identity.

There's also a pragmatic consideration: checklist compliance. A common compliance requirement in FIPS compliance is needed for many US government use cases. WireGuard is not FIPS compliance and probably never will be. For many people, this alone eliminates it from consideration. IPsec, on the other hand, is or can be FIPS compliance, but it's often complex. If you're running it in many of these CNI (Container Network Interface), you have to pick between IPsec and many of the features they offer. It's not like you just turn on IPsec and everything works fine. Some of these features are significant, such as the Kubernetes Gateway API, which is one of the bigger exciting projects in the ecosystem. However, if you have IPsec, some CNI (Container Network Interface) disable their support for this API.

Another aspect to consider is performance. Often, people make the wrong choice here and look at Mutual TLS (mTLS) and WireGuard, picking WireGuard for performance reasons. Because WireGuard has been hailed as a high-performance VPN implementation, it's moved into the kernel natively, which increased its performance. However, if you look at benchmarks comparing Mutual TLS (mTLS) to WireGuard and IPsec, you'll see that Mutual TLS (mTLS) performance is actually largely better than WireGuard in most cases, or equal in other cases, depending on what you're looking at in terms of latency. There's a perception that if something's in the kernel, it automatically makes it faster, but that's not the case. The processor is still doing the same amount of work, whether it's in user space or in the kernel. TLS (Transport Layer Security), which is not in the kernel, is able to perform faster because processors have special hardware optimized for TLS (Transport Layer Security). At the protocol layer, it's able to encode or encrypt more information at one time in bigger chunks, making it more efficient. If you do a benchmark, for example, you'll probably see WireGuard and Mutual TLS (mTLS) have the same latency, but Mutual TLS (mTLS) has three or four times more throughput. That being said, for most people, this doesn't really matter. So, if you're picking your encryption mechanism based on performance, make sure you really care, because for the most part, the difference between two gigabits per second and 10 gigabits per second is far from relevant to where organizations are struggling.

Bart: Security in Kubernetes often involves multiple layers. How does Mutual TLS (mTLS) interact with other Kubernetes security features, such as Network Policy?

John: It's a good question, one that's being explored in the Network Policy space. There's a newer effort to evolve network policies, specifically looking at how to make Mutual TLS (mTLS) more deeply integrated, and not just mTLS, but really any identity, like pod identity-based network policies. Today, Network Policy and Service Mesh policies interact, but not in a layered, ideal way. Generally, in Service Mesh and CNI (Container Network Interface), they're separate layers. The CNI (Container Network Interface) enforces traffic first based on Network Policy, and then it hands it off to the Service Mesh, which enforces its own layer policies on top of that. The Service Mesh does its own because it has more information, including access to TLS (Transport Layer Security) identity, and often HTTP attributes or other higher-level protocols on top of TCP. This allows for more specific policies, such as denying post requests or authenticating a JWT token in an authorization header. The layering is kind of there, but it's not a cleanly layered system where we're gradually enforcing more and more information. In Ambient Mesh, for example, we have two layers: the Node proxy (ztunnel), which handles TCP and mTLS, and operates on one layer of identity and authorization; and the Waypoint proxy, which can be put in the path and adds another layer of authorization on top of that, more service-based and HTTP-based. However, there's still a bit of an awkward gap between Network Policy and Service Mesh policy. Often, in practice, users either do a defense-in-depth strategy where they enforce both, or they use one or the other. If they're doing both, they're either enforcing a similar policy at both layers or using Network Policy as a coarse-grained tool and Service Mesh for fine-grained control. For instance, they might use Network Policy to block internet access for an application or allow communication between namespaces, and then use Service Mesh to specify more detailed policies, such as which identity can talk to which workload on which port and path.

Bart: As Kubernetes and cloud native technologies evolve, so do security practices. What emerging trends or technologies do you see in the realm of Mutual TLS (mTLS) and Kubernetes security, particularly with regards to Service Mesh, TLS (Transport Layer Security), and CNI (Container Network Interface)?

John: The big thing I found over the years is that it's not so much about the features in security, it's about getting people to use it. The biggest issue we had for many years with Istio is that I listed all the things you can do with it, and it's a laundry list of features. If anyone adopts a fraction of the features, they will have achieved an incredible amount of value. However, it can take a lot of effort to do that. It takes a decent amount of effort to even get the first feature, and then each incremental feature is a bit of a cost. The problem we saw again and again was people coming to us and saying, "I just want mTLS." This is like the title of the blog post: "I just want TLS on Kubernetes." We have this great tool, it's Service Mesh, and we'll give you all this value, and buried in that value is mTLS as well. You'll pay the cost of all of it, even if you just want mTLS. So, what we did, and this is the origin story for Ambient Mesh, was to ask, what would it look like if we just served the mTLS use case and optimized for that? We made it dead simple to get mTLS, and once you have that, you can use it as a stepping stone to get the rest of Service Mesh. That's how we came up with Ambient Mesh, where we're hyper-focused on just the simplest use case of mTLS. One of the key goals, for example, was low footprint compatibility, so we don't modify any traffic, we don't break any traffic, and we should be able to work anywhere with any application rolled out cluster-wide without fear of application behavior changing. This was a common issue with Service Mesh because we were injecting these great features like HTTP load balancing, which is usually good, but sometimes changing things, even for the better, is breaking applications because they expect certain things. By default, we don't do that; we don't modify traffic. That's kind of the way to get security - you just need to get it everywhere. That's really step one, and most people aren't at that phase. Once you get things there, then you can start playing with the fancy features, like what kind of fancy authorization policies are we doing? Maybe I have some crazy AI analysis tool that's telling me this is an anomaly. But you can't do the fancy things when you don't have the baseline, and most users don't have that yet. So, that's kind of where I think the main focus is.

Bart: And we've covered a lot of ground regarding Mutual TLS (mTLS) and Kubernetes. Given all these options and considerations, what would you recommend for teams looking to implement mTLS in their environments?

John: In general, the DIY approach is problematic if you're not aware of all the things it involves and don't have a clear vision of what it looks like to go from start to end and how to get there. For most users, I would not recommend it. I've actually seen few cases where it's been successful in a large organization. Service Mesh is the way to go. Ambient Mesh, in particular, is suitable if you just want mTLS, as it was designed for this case. If you look at the ideal way to get mTLS everywhere in Kubernetes, ignoring everything else, Ambient Mesh has exactly what you'd come up with. So, to me, that's the obvious choice. If you're already on an existing service mesh, such as Istio sidecars or other service meshes, that's also fine. You can start using it today to get mTLS. In general, I would recommend Ambient Mesh, traditional sidecar mesh, and then DIY as the last resort. There's also the CNI encryption layers, but it's not mTLS, which was the question.

Bart: When you're not working with mTLS, Service Mesh, Kubernetes, what do you like to do in your free time?

John: I'm into cooking. I like to switch up what I'm going through. Sometimes pizza, sometimes ice cream, it always varies.

Bart: So, what's next for you?

John: That's a great question. For now, we have more Service Mesh. We've got Ambient Mesh to GA in Istio 1.24 about three months ago. That's only the beginning. Now, it's about seeing it go into production with real users, getting their feedback, and continuing to iterate to get it everywhere it should be. I'm excited to continue on that. It's been fun since I joined Solo.io, not even a year ago, and we've moved quickly. I've been enjoying that.

Bart: And how can people get in touch with you, perhaps through Solo.io or other communities like Kubernetes Slack, Istio Slack, or CNCF Slack?

John: I'm on all the Slacks. My name is John Howard, and you can find me on the Kubernetes Slack, Istio Slack, and CNCF Slack, as well as GitHub, LinkedIn, Bluesky, and my blog. I write about Go, Kubernetes, Istio, Service Mesh, and related topics.

Bart: Fantastic. John, thank you for joining us in KubeFM. We look forward to speaking to you soon. Take care.

John: See you. Have a good one. Cheers.