Balancing productivity and security in platform engineering

Dec 17, 2024

Guest:

David Sudia

In this interview, David Sudia, Senior Product Engineer at Teleport, discusses:

Systematic debugging approaches including keeping detailed timestamped logs and mapping system layers to troubleshoot complex Kubernetes issues effectively
Zero-trust security implementation combining Network Policies with SPIFFE workload identity to secure service-to-service communication
Platform engineering challenges and why treating internal platforms as products with proper product management is crucial for success

Relevant links

Transcription

Bart: My name is David Sudia. I work at Teleport. And what's my role?

David: Hi, I'm Dave Sudia. I'm a Senior Product Engineer at Teleport, which means I have some product management duties. I'm currently focusing on making our day one experience for new users as good as it can be, and I also contribute code to help get these things built.

Bart: What are three Kubernetes emerging tools that you're keeping an eye on?

David: For emerging projects, I will focus on the sandbox to keep it simple. I really like Cartography, as anything that helps visualize things is great. We are working on a policy visualization project as well. In these environments, everything gets so complex and it is hard to hold a mental model, so picturing it is awesome. Eraser is a great project that automatically cleans up old images with security vulnerabilities, making maintenance automation a win, especially since it is security-focused. While looking through the sandbox, I was surprised to see that K3s is still in the sandbox, but it is my absolute favorite tool for local testing and small Kubernetes development. As a bonus, SPIFFE has graduated, and I will discuss it further later.

Bart: Troubleshooting tips and the learning path. One of our podcast guests, Alex, spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. He stressed the importance of learning while troubleshooting. Is there any practical advice you have learned over the years regarding debugging, especially when working with Kubernetes, and tools like Teleport?

David: My first tip is to keep a log, which can be notes in a text editor or a physical notebook. I keep a physical notebook next to my computer because when you're in the heat of the moment, with 10 windows open, it's easy to lose track of notes. I use a pen and paper to timestamp my actions, as it's essential to keep track of the changes made while trying to figure out the problem. You're likely to make 15 to 500 changes, and if you don't know when you made those changes, it's challenging to determine which one actually fixed the issue. Being able to go back and look at the metrics allows you to identify the exact change that resolved the problem. For instance, if things finally calmed down at 2:53, and you made a permissions change at that time, you can conclude that the permissions change was the solution.

The second tip is to have a map of the layers in your system ahead of time. With many layers, including NAT gateways, API gateways, service meshes, and sidecars, it's crucial to understand how to step through them one by one, turn them on and off, and isolate variables before something goes wrong. This approach is critical in identifying the root cause of the problem. I've seen cases where developers thought the issue was with Linkerd, but it was actually something else. Being able to remove Linkerd from the mix and verify that the problem persists allows you to rule out that variable and move on to the next one, all while maintaining visibility and observability.

Bart: Network policies and zero trust. Our podcast guest Jen argued that if you are supporting production infrastructure and don't want someone to compromise a cluster and then be able to travel laterally through it, you should consider investing in Network Policies. What is your advice for securing the network in a Kubernetes cluster?

David: I love Network Policies. Everything is defense in depth, and you need to secure things at the network level. The thing that I'm really excited about right now in the security space within a cluster and beyond is workload identity and the SPIFFE standard. This allows you to give every single workload a cryptographic identity. It gets a certificate saying, "I am the payment service in this region, in this cluster." You can then use that, and when services talk to each other, they both have a cryptographic identity. They can agree, "I'm the client, I should receive a request from you, I am allowed to respond to you." This not only gives you mutual TLS, but also a guaranteed contract between the services. Even if someone is able to get past the network policy and over to the service, the service will still reject it because it's not from the specific entity that it will take requests from. We're building a piece of workload identity solution right now that scales really well at Teleport. Some of our largest customers are putting it in action, and I'm really excited to see where that space goes.

Bart: Bonus question on this one, Dave. When we ask people about their least favorite Kubernetes feature, Network Policies management was at the top. Why do you think that is?

David: I was talking to Zach Butcher, the founding engineer at Tetrate, about workload identity and how to structure SPIFFE IDs. Because you can take it from pretty simple, like "this is the payment service", but you can encode all kinds of metadata in it, such as "this is the payment service in the EU, in this cluster", with as much detail as you want. His recommendation was not to put as much detail as possible in there. I asked him why, and he said it's because it will make authorization more complicated down the line when someone tries to launch a new thing that's supposed to talk to this service, but you forgot that you encoded that specific piece and then you don't have this in there.

This concept also applies to Network Policies. You can get really fine-grained with security, but the more fine-grained you get, the more complexity you're adding to your system and the more things you need to remember. Network Policies is similar to SPIFFE IDs in that it specifies what is allowed to communicate with what. For example, you can say that a namespace or a subdivision of IPs inside a cluster may only accept requests from a specific IP range. However, if you launch a new set of nodes and forget to provide that pool of IPs, it will not be permitted.

I like Cartographer because it provides another visualization, allowing you to see what should connect to what. There's always friction when it comes to security tools or features, and the challenge is balancing the tension between productivity and security. Sometimes it's hard to get the balance right.

Bart: Platform engineering and people. Our podcast guest, Ori, shared that rushing into solutions without understanding the root cause can lead to fixing symptoms instead of the actual problem. He mentioned the case of Network Policies and how sometimes the root cause of a problem is a people problem, and the solution lies in addressing that. What is your experience with providing tooling and platforms on Kubernetes to other engineers? What are some of the soft challenges that you faced?

David: When I was a platform engineer, my main realization was that we were building a product as we tried to put together all these tools, write our own tools, and bring in vendors. However, most organizations, even if platform engineers realize this, do not consider it a product, especially people above them who make resource and budgeting decisions. Platform engineers and backend engineers tend not to be product managers, making it difficult to build a product that meets the actual needs of users. Many platforms end up being driven by features instead of solutions. As a platform engineer, the resource I wanted most was not more money or more platform engineers, but a product manager to talk to all the developers and help me figure out what we should be offering them and what tools they needed. It is hard to get allocation for this because organizations do not think of it in this way, which is how shadow IT emerges - when you are not giving people the tools they need in a way that they like using them. What drew me to Teleport was the tension between productivity and security. We focus on ensuring that developers are happy using our tools, and as an engineering-driven organization, we must give people a tool they want to use, even if it is a tool they have to use. Otherwise, they will find a way around it.

Bart: Now, Kubernetes turned 10 years old this year. What can we expect in the next 10 years?

David: It makes me feel really old. In 10 years, I'll feel even older when you tell me it's 20 years old. More seriously, I like this moment because there was a keynote about the balance between innovation and stability this morning. This was the first KubeCon where I recognized every name of every standard in a talk. I saw OpenTracing, OpenMetrics, and OpenTelemetry, and I finally saw OpenTelemetry established for a long time without turning into something else. This applies to many other standards as well. I recall giving a talk at a conference in 2020 called "If you can wait six months, you should," because the ecosystem was moving so fast that things would shift out from under you while you were building them. It was like saying, "If you can wait, things will get even better, with new innovations emerging." Innovation is good, but I like the fact that things are starting to stabilize. In the next 10 years, I look forward to people being insanely productive because the standards are landing, people are building great tools on these standards, and end-users will be able to use those great tools without having to redeploy every two months. I believe there will still be innovation, but we will see an insane wave of productivity in this space.

Bart: What's next for you?

David: KubeCon has been a blast as always. Next for me will be sitting for a long time and getting back to the day-to-day of making our product better, or simply getting back into it.

Bart: How can people get in touch with you, Teleport?

David: You can reach out to me on LinkedIn, which I check at least once every 30 days, or send me a message at david.sudia@Teleport. I always check my email, but I'm mostly off social media for my own health, so email is the best way to reach me.

Podcast episodes mentioned in this interview

Network Policies are the wrong abstraction
with Ori Shoshan
Troubleshooting a validation webhook all the way down to the kernel
with Alex Movergan
Observability will speed up your Kubernetes troubleshooting
with Jennifer Luther Thomas