Metrics ingestion, security tools, and effective debugging

Guest:

Felipe Martinez Amaral

Felipe Martinez Amaral, Cloud Architect at DoiT and Google Cloud specialist, discusses:

When debugging, it's essential to understand the domain well. While using AI might be tempting, Felipe advises caution and not to take it at face value.
It is crucial to ingest relevant metrics, take actionable steps, manage log levels, and carefully consider data storage and retention.
The future of Kubernetes where he sees it being improved, not replaced, with a focus on easier onboarding, AI integration, and enhanced security measures.

Relevant links

Transcription

Bart: Who are you? What's your role? And who do you work for?

Felipe: My name is Felipe Martinez Amaral. I'm a cloud architect, working for DoiT, as you can see here. I've worked with Kubernetes for a long time. I was a Java developer in the past, but now I focus on the cloud, mostly on Google Cloud.

Bart: What are three emerging Kubernetes tools that you are keeping an eye on?

Felipe: The first one will be Falco, which is a security tool and really interesting. Tetragon from Cilium is also nice and focuses on security and observability. The third one will be OpenTelemetry, which I think is probably... Maybe going to be better than Prometheus, gaining more traction nowadays. I think that's going to be the third one.

Bart: One of our guests, Alex, spent several weeks troubleshooting an issue with Kubernetes, which required the team to explore the kernel code. He stressed the importance of learning while troubleshooting. Is there any practical advice you've learned over the years regarding debugging?

Felipe: Debugging. That's my day to day actually. I work with lots of customers, and every day I see new troubleshooting issues. Usually, I find issues that I don't really know how to fix. The first thing I would say is you need to understand the domain you are looking at, right? You need to understand, for example, if it's security-related, network-related, or let's say some timeouts. We need to know where we are looking. For example, look for DNS. Look for iptables or maybe CNI. That's the domain. Another thing to consider is AI. AI is everywhere, right? It should be really helpful for you. But be careful because you need to understand hallucinations as well. Don't take for granted what they say, but use it as a reference to get faster results. Use it to find documentation, find some source code, and you can go for it.

Bart: Another guest of ours, Mat, shared that most teams store logs and metrics in Kubernetes without considering the implications of the data they collect. Consequently, they end up paying a hefty price for data that is not actually used. What's your advice on ingesting, storing, and querying metrics in Kubernetes?

Felipe: There are three things here. Ingesting, right? Are you ingesting the right metrics that you need? Sometimes you don't need to ingest everything that the applications are exposing. Make sure that you are only getting the right metrics. And that you are getting actions from them. Observability is only good when you take actions. It's important for you when you use those. Just metrics for the sake of having them doesn't make any sense. Also, for the application level, be careful when you are logging too many things. For example, debug-level logs or trace-level logs. Logging enough information or... You should be able to filter out on the application level, but also in other levels. For example, if you're using Google Cloud, you can filter on the log bucket. When you do the collection of metrics and querying, be careful with cardinality. If you have metrics with lots of different points of information, higher cardinality means more storage, more data, and of course the retention as well. How long is it going to take the retention of this metric? Do we need to keep this for a year, six months, a month, a week? Be mindful when you choose those numbers.

Bart: Kubernetes is turning 10 years old this year.

Felipe: I think Kubernetes will not be replaced, but it will be improved. In general, one thing that the community needs is to make things easier for those who are onboarding on Kubernetes because it's really complex. The landscape is huge on CNCF, but also in the Kubernetes area. There are lots of things that you can learn there, and it takes a long time. I believe that as we go over the years, AI will be introduced as well. That can help you, for example, to manage your clusters. Of course, we always need someone there to make the infrastructure work, even with AI. What's next for you? Always new things to learn, right? I think I'm currently focused on security and trying to get more knowledge on different tools that we have here. For example, understanding how we can implement hardening in your cluster and how we can protect your cluster even more. Trying to get from the bottom up all the security tools that can help us and our customers.

Bart: How can people get in touch with you?

Felipe: To get in touch with me, you can go to LinkedIn. My LinkedIn is femrtmz. Fe from Felipe, MRTNZ from Martinez without the vowels. Or you can find me by my name, Felipe Martinez Amaral.

Podcast episodes mentioned in this interview

Foolproof Kubernetes with GKE
with Mathew Duggan
Troubleshooting a validation webhook all the way down to the kernel
with Alex Movergan