Observability will speed up your Kubernetes troubleshooting

Host:

Bart Farrell

Guest:

Jennifer Luther Thomas

This episode is brought to you by Otterize — automate workload IAM policies: zero-friction development, zero-trust security.

With a passion for security and a knack for troubleshooting, Jen discusses the critical role of network policies in Kubernetes security, the complexities involved in their implementation, and the balance between security and manageability.

She also covers the importance of Custom Resource Definitions and shares her perspective on emerging Kubernetes tools.

In this KubeFM episode, you will learn:

The importance of observability in troubleshooting network policies and how it aids in debugging complex issues.
The trade-offs between the complexity of network policies and the security benefits they provide.
The skills, thought process and humility behind troubleshooting technologies you are unfamiliar with.

Relevant links

Transcription

Bart: Age of Empires, looking for scams on Reddit, powerlifting, what do those things have in common? They're all related to the guest we have in today's episode of KubeFM, her name is Jen, and she's an expert on those three things and is really into network policy. In this episode, Jen shared her experience with debugging network policies and the importance of observability in troubleshooting. She also discusses the trade-offs and worthiness of network policies. The role of custom resource definitions, something we love to speak about here at KubeFM, and her passion for security. Network policies are essential for securing Kubernetes clusters and controlling network access. Jen spoke about the trade-off between the complexity of network policies and the level of security they provide. You'll find more about that in today's episode. Speaking of network policies, today's episode is sponsored by Otterize. Are you tired of wrestling with AWS IAM, Google Cloud IAM, Azure IAM, network policies, and database access? Say goodbye to that headache and say hello to Otterize. Otterize offers a declarative and zero-trust approach to access management, allowing you to declare workload IAM within your Kubernetes cluster while ensuring maximum security. Visit Otterize.com to get started with their free forever plan today. Thank you. All right, Jen. So what three emerging Kubernetes tools are you keeping an eye on?

Jen: I will admit the last few months have been really busy for me in my personal life. I've had other priorities. So I've been spending less time on Reddit, keeping an eye on what's going on. So I don't have anything specific. But one thing that I am curious to see how it plays out is the talk about AI now. I don't know if you've seen AI and Kubernetes start to fuse together at KubeCon. I can tell by the look on your face, yes. So I definitely want to keep an eye on that. Definitely a good idea to keep an eye on that.

Bart: There's been a lot of conversations at KubeCon about that topic. Many of the keynotes were focused on that. And I think we can expect to see more. So, I think that's a very good thing to keep your eye on. Like you said, you choose Reddit as your way of staying up to date. Other folks might look in other places, but it's certainly a good spot to catch news around that. So now, tell us more about yourself. What do you do and who do you work for?

Jen: Currently, I'm a technical marketing engineer at Tigera, who are the creators and maintainers of Project Calico. Being technical marketing was new to me when I first took the job, as I'd never heard of a technical marketing engineer position before. It was kind of new to me. It's still very hands-on with Kubernetes, learning, getting in with the product, configuring things, but more about educating people about what Calico as a CNI can do.

Bart: Before working, before getting into that role, what were your previous roles? Where were you working? What kind of stuff were you working on?

Jen: I took a weird path to Kubernetes. My previous role was as a technical support engineer specialist. I was working for a software company that made a geospatial ETL tool. So I came from a geospatial background, like GIS, environmental science at university. I ended up getting into software with data conversions. I don't know how many people listening would know what ETL is, but you're basically taking different types of data, different formats, and then you can convert it to other formats of data. You can enrich it as that data gets converted. So you can bring multiple sources of data together, join them, and then get an output that you need. That software, which is called FME, had an enterprise component to it. So instead of just building those workflows, you would then start to automate them. So that's where I started. I moved into supporting the enterprise product, which eventually led me to support the cloud and container version of that product. I started supporting cloud customers who had a SaaS product, as well as anyone installing on Docker and Kubernetes. And admittedly, I knew nothing about Docker and Kubernetes when I had to start supporting these customers. It was a steep learning curve, but I really enjoyed the challenge of Kubernetes. I also really enjoyed making presentations, training people. So that kind of led me into technical marketing.

Bart: Fantastic. And you mentioned previously about Reddit being one source, a resource to find out about different tools and things happening in the ecosystem. Are there other resources that you tend to use to stay up to date? Because as you well know, it's a very quickly moving ecosystem. There's all kinds of stuff coming out all the time. What works best for you?

Jen: I think Reddit is probably the main one just because I spend so much time on Reddit anyway, personally. It's easy to just keep diving in and seeing the Kubernetes stuff go by on your home feed. I used to spend more time on Twitter, back in the day, but I think I was more in the geospatial community. I've tried to find a lot of names more in the Kubernetes space, but I feel like Twitter just maybe isn't as big as it once was, or I'm just following the wrong people. But I know a lot of people at work seem to keep in touch with Hacker News, I believe it is. I personally haven't got into that one yet.

Bart: Fair enough. But yep, like you said, found something that worked for you and sticking to it. Now, in terms of the topics that we want to look at today, your journey with Tigera and Calico started before joining the company. Could you share with our audience what happened before you jumped ship?

Jen: It was kind of a good coincidence in a way. Tigera had approached me, and at the time, I thought I knew Kubernetes, but it turns out I didn't really know Kubernetes at all. Which is probably the case for a lot of people. So, as I started looking into more about what Tigera did, what a CNI was, and what network policies are, at my previous job, I was supporting people who were installing the product, like the Kubernetes deployment of the product. And one customer had been using OSM, open service mesh, and they had implemented policies. So... At the time, I'd never seen anyone before come to me with a security question in Kubernetes. So this was the first time being exposed to it. The application that I was supporting for their Kubernetes deployment had multiple pods. So, each service or part of the application was a different pod. I knew they would all communicate within a cluster, but I didn't know that you could add network policies and kind of block or control that communication between pods within a cluster. So, at the same time I had a customer coming to me with issues about service mesh, I was also beginning my journey of looking into Tigera and Calico and network policies to interview for them. So a lot of rapid learning all at once. And I still didn't feel super confident with it, but I was at least beginning to get exposed to, okay, you can actually do this because none of the customers I'd worked with before had ever come to me with those security concerns or network policies.

Bart: And particularly, there was an incident related to OSM and FME. Some of our audience members might not be familiar with those. Can you walk us through and explain exactly what that incident was?

Jen: As I mentioned, I was helping people with their Kubernetes installations of FME. The enterprise version of FME had multiple components. There was a Redis cache, there's a database, which was Postgres. There was a core pod, which is kind of like the brain of FME, which would delegate tasks. The pods that would actually do the data transformation and the ETL were called engines. So I was supporting people who would use a Helm chart to deploy all of that. And then all of these pods would be inside their cluster. One customer using Open Service Mesh, I believe they were using policies with Open Service Mesh. As they were doing that, they were locking down the communication between all of those different parts of FME. I think they had opened a majority of the ports that were listed in our documentation but had missed one port pool. So it was easy to target the obvious ports. You talked to Postgres on 5.4.3.2. Open that. But when the pods would initiate communication on one port and then move off to a port pool to carry on that communication, that was one of the things they missed. So to them, in some circumstances, the application wasn't working properly. And it was really hard to narrow down exactly what it was when they thought they had built the policies to allow all of the communication that was needed.

Bart: So there's an issue. What's your thought process at this exact moment? How did you address this?

Jen: In the beginning, it was probably like, "Oh my gosh, I've never even heard of Open Service Mesh." I've never heard of network policies, like what is going on? So I think it was more like looking for similar behavior in the past that we had seen from more traditional deployments where people might be installing different parts of FME on different virtual machines and then using firewalls, essentially the same thing as what was going on in their cluster. As it was around the time I was interviewing for Tigera, I think I was starting to be made aware of network policy. So I also tried to install and get Open Service Mesh running, but I didn't have any success with that to try and reproduce the issue. Cause that's the other thing I really like to do is set up an environment that matches what our customer is experiencing and of getting there and play yourself. But I was unfortunately unable to set that up. It was kind of hard on my end. I had access to the product and the product knowledge and I knew what ports needed to be open, but I didn't have the experience with Open Service Mesh. They had the experience with Open Service Mesh, but not with FME. So I think there was some disconnect there, which made it hard for both of us to kind of figure out what the issue was.

Bart: And what was the fix and how long did it take to figure that out?

Jen: Embarrassingly, they came up with a fix, not me. I was kind of stuck; I couldn't reproduce it. I was like, "Here are the ports you need." They said they'd opened them. But then, a few weeks later, we all got on a call, and it turns out they had just missed one rule in one policy to allow access on a port that I believe the database needed to communicate over. So once they added that into their network policy, they were good, and the application was working as expected. But I think hunting down that port pool that was blocked took them a while.

Bart: With that in mind, it's come up numerous times in different episodes about network policies. Some people may not be completely familiar with what they are. Can you just establish for our audience what they mean to you, how you define them?

Jen: As in network policies, if you're already in Kubernetes, you're familiar with the YAML manifests. Network policy is a way of describing how you want to control network access to different pods in that YAML manifest. So you would describe how you want one pod to be able to access another pod, or pods in a namespace to access pods in another namespace, by writing that YAML manifest. And it's kind of similar to the way you would set up firewalls in virtual machines, except with Kubernetes, pods don't always stick around for very long. So it's hard to tie things to a static IP address or anything like that. So these network policies work based on labels. It makes it a lot easier, in my opinion, to tie one of these policies to a group of pods with a label, and you know that traffic is secured to or from those groups of pods. And then you can do that as well for ingress rules, egress rules. So you can really control that network access to anything in your cluster.

Bart: All right. Now you mentioned in the article several weeks of debugging. It sounds like the problem was pretty difficult to test and then to reproduce. Was that the case?

Jen: I believe so. On my end, it was more difficult setting up Open Service Mesh. As someone who's new to it, I didn't fully know what I was doing. And I feel like it was another one of those steep learning curves with Kubernetes. Now I feel like I get it more; it's easier. And I feel this is similar for the customer. I don't know exactly what products they had apart from OSM, but I imagine there was no easy visibility into what was going on in their cluster. They couldn't see any traffic flows that were trying to be made and then getting denied. So they lacked that visibility and insight into their cluster. They struggled with understanding cluster network traffic to troubleshoot.

Bart: And so, what's the way to fix it? Is there a way to inspect and observe the issue without spending weeks of troubleshooting?

Jen: When I started working at Tigera, I realized they had a product with something called a service graph. As soon as you connect a cluster into Calico, it basically comes up with this topological view of everything inside your cluster. So if you've got pods in there, you've got services, those are all visible with lines going between each object so you can see that there is traffic flowing. As I first started, I remembered this problem I had back when I was working at SAFE. I wanted to test it and see how easy it would be to solve the issue. Having just a visual representation of all the traffic going on in your cluster, as soon as you can see that a policy is denying traffic that you've set up accidentally, it was really easy to spot that because you just get a big red line. So you can see that traffic is not communicating with the database. And then you solve it basically in minutes. I think anyone using any kind of observability would solve that so much faster. But even now, depending on the CNI, Project Calico enables you to set one of the actions in your network policies to log. So that could have been another sort of workaround. We've implemented all these policies, it's not working. Let's add more policies just with that log action and see what is coming out of those logs. It would take you a bit longer to read through a whole bunch of logs, but... Hopefully, it would help you troubleshoot that faster if you didn't have observability.

Bart: Calico Cloud is a paid product. Have you seen or used open source tools that have some overlap with it and can help engineers perform the same tasks?

Jen: I haven't come across any or used any. But I believe there are tools like Fluentbit, which works with logs. You can correct me if I'm wrong. I haven't tried it myself personally.

Bart: Some people mention Prometheus when discussing observability, but I feel like that's more of a metrics type thing rather than monitoring network traffic in the same way. I personally haven't. One of our previous guests, Ori, expressed unique opinions on network policies, mentioning that the usage of network policies is inverted. If you need to contact a pod in a different namespace, you need to contact the other team responsible for it and have their network policy amended instead of modifying your own network policies. Do you agree with Ori?

Jen: I don't know. I think. I think it's hard, and I remember reading that when it first came out a little while back. There would still be a learning curve to what they're proposing as their solution because the problem that took me so long to troubleshoot was that it was a completely new domain for me. Had I had prior experience with network policies, I don't think they're hard. It just takes a little while to get the hang of. So, that might be the same if you're achieving the same thing in a different way. You're still going to have to learn what you're doing. And actually, I was curious about this because I'm not on the ground in an organization implementing this. for anyone. I was curious and posted on Reddit asking if people are using network policies and if not, why not, or if you are, why. A lot of the opinions from people implementing network policies suggest that once you get it going, it's not that hard. If your organization is set up correctly, having someone with an overall or holistic view of the different policies can manage interactions instead of having developers do it themselves with their applications. It might depend on who is trying to do this, the size of the organization, the skill level in the organization, and how much support they get. That seemed to be the theme from the people replying on Reddit about what the right approach is.

Bart: It's interesting that you asked that because our previous guest, Ori, argues that network policies are not the right abstraction. Hearing your story about spending weeks debugging a port mismatch might make some people think that we should just give up on them entirely. But as you saw from Reddit, seeing different responses, are they worth it at all? Would you say that there's always a trade-off, or are we going to rule that out? What do you think about that?

Jen: It depends on what you're actually putting in Kubernetes, to be honest. If you're deploying production-level applications with sensitive data that you need to make sure doesn't go down, you don't want someone to compromise that cluster and then be able to travel laterally through it. If they compromise your front end and it's in the same namespace or cluster as your backend and someone gets access to that. I think network policies. You're probably going to want to spend the time investing upfront to learn and implement those correctly. Otherwise, the time and money that you're going to spend correcting an issue, if something goes wrong, are probably way worse than just the initial effort. So, I think if you need it, learning and setting up a network policy is definitely worth it. But at least for me, if I'm setting up something quickly to test, I'm not going to spend a long time writing network policies. You're going to do that based on how secure you think you need something or even what the organization's security strategy is. If some people are going very zero trust and want everything locked down, I think network policies are going to be worth it.

Bart: Now, Calico, Cilium, and maybe a few other network plugins have custom resource definitions, a superset of the Kubernetes network policies. In the case of Calico, we're talking about network policy and global network policy objects. Do you think network policy and their APIs are too limited and should be redesigned?

Jen: I don't know if they're too limited. But I guess some of the features that come with using Calico as a CNI, like being able to do global policies, make network policies easier because now instead of defining lots of network policies to target each thing, you can apply more blanket policies, which might make things easier. And if it was in Kubernetes, it might be more approachable. People might find that more appealing because I feel like once people pick a CNI, they tend to stick with it. I don't think people are having five different CNIs and then getting confused about what to do. So I don't really see that as an issue.

Bart: Totally fair. No right or wrong answer. From these two weeks of debugging, what do you take away from it? What are your key takeaways? What could you have done differently if you had to go back in time?

Jen: I guess if I went back in time with the knowledge I have now, that would be handy. I think just knowing other ways to troubleshoot it. Like back then, I didn't have access to Calico or any other project that had observability. I think it would just be knowing, okay, give me all your policies if they could, so I could read through them, double-check. And even one of the tools that we use when we present workshops to test and prove communication is just going inside the pods and using something like netcat. Can I go and communicate with this other pod? And then it might take a little while, but going inside each of the pods that I suspect based on my product knowledge is maybe the issue and just going and testing all of that. One by one. But if I didn't have the knowledge I have now, it would be another case of going and spending a few days intensively learning network policies and then re-approaching the issue instead of trying to reproduce it without fully understanding what I'm doing. But I guess when you're doing customer support, you don't have the luxury of time. Certainly not.

Bart: So, if you think on your feet and make moves. That all makes sense. Now, where did your passion for security start? Where and when? How did that happen? What brought you into it?

Jen: I think it was just partly interesting because it was something new. Like I mentioned, when I was at Safe, none of the customers came to me ever, apart from this one case, with security issues. So I guess it was ignorance on my part; I didn't really know or understand what security looked like in Kubernetes. So when I started to look into Tigera, I was like, "Whoa, there's actually a lot here and it makes sense." So I think part of it was the challenge and just being interested to go and learn something new. I had read things like the Phoenix Project and a book on security. I can't remember what it was now, but it was about cybersecurity and how sophisticated that all is and how many things go on that you probably have no idea about. So I think from that standpoint, it's interesting, and I feel like it's also a bit of a hot topic. You always see people getting hacked, and there was a big one in Vegas recently. I actually really enjoyed going into the cybersecurity subreddit and reading what everyone was saying was going on, like how they got breached, how they got taken down so badly. So I find that quite interesting. And following up with the Reddit point, from what we saw, it seems that you're an avid reader of Reddit scams.

Bart: And how does someone work in security and then spend time browsing that kind of content? Is there something we should know, Jen?

Jen: No. I wouldn't be working for Tigera if I were a big scam person.

Bart:

Jen: Yeah. I don't know. I just find it interesting. And I guess I almost feel bad. I guess we come with a privilege. We have pretty advanced technology. When we see these sketchy emails coming from people like, "Here's a delivery from some mail provider you're not receiving a package from. Go and give all your information," I feel like we can easily spot phishing links more than the average person, like my nan, for example. So, I like to sort of see what the latest scams are, and I guess it would be cool to educate more people. I feel like to avoid those things that maybe aren't as tech-savvy as we might be, so it's good. Alright, when you're not doing that, it looks like you're into Age of Empires 4 and powerlifting. Tell me more about that. So, I feel like Age of Empires actually is something I used to play with my dad as a kid. And now I've moved to Canada. My dad and brother are still in England. It's actually a really good way for us to connect and just play games online and keep in touch. But I will admit I'm not the best. I tried playing online against strangers once. And then that was the only time. I got defeated so fast. But in powerlifting too, I'm actually the president of the British Columbia Powerlifting Association. So, aside from competing and training, I spend a lot of time running the organization, putting on events, doing all of that. So it takes up a lot of my free time. That's a good way to have no free time between scams, Age of Empires, and the time change from British Columbia to the UK. Running powerlifting events, I'm impressed. So, what's next for you? The powerlifting scene has calmed down a little bit, so I'd like to go back to writing a few more blogs on Medium because my goal with those was Kubernetes is definitely a steep learning curve. And I found that when I was at Safe Software and I was supporting customers, a lot of them are having to use this technology but don't fully understand it. I wanted to try and share my learnings in a very accessible way for people. So I would like to get back, write a few more blogs, still stay busy with the powerlifting side of things. Yeah.

Bart: Well, that sounds good. And if people want to get in touch with you, whether it's about scams, network policies, powerlifting, etc., what's the best way to do it?

Jen:

Bart: I check LinkedIn more. I'll occasionally still go on Twitter if people want to connect with me there, but probably lesser. Like I mentioned, maybe the right people will start following me now, and that'll be better discussions. But yes, Jen, thank you very much for your time today and your hard work on getting this information out there. Like you said, there is a steep learning curve, and by having these kinds of resources, it makes it that much easier for folks out there to not have to suffer so much when learning some of these challenging topics. Definitely, that's the goal. All right. Thank you very much. Cheers.

Listen anywhere