Metrics ingestion, data decay and entity-centric observability
Miguel's unique perspective offers fresh insights into cloud-native observability and strategically managing a large volume of metrics.
In this interview, Miguel Luna (Principal Product Manager at Elastic) discusses:
The challenges of managing metrics in Kubernetes environments, highlighting the importance of metric ingestion, data decay consideration, entity-centric observability, and user role-centric approaches.
The crucial need to bridge the knowledge gap between AI experts and Kubernetes.
How applying industrial engineering and supply chain principles can significantly improve cloud-native observability.
Transcription
Bart: Who are you? What's your role? And who do you work for?
Miguel: My name is Miguel Luna. I'm a Principal Product Manager at Elastic, the leading search analytics company. I'm originally from Colombia. I lived in London for close to 20 years. I've been involved in the cloud native community for about six to seven years, working at VMware, Telefonica, and most recently with Elastic.
Bart: What are three Kubernetes emerging tools that you're keeping an eye on?
Miguel: First of all, I would say OpenTelemetry. Being on the elastic observability side, I've been involved with OpenTelemetry, not just because it's about standardizing data collection, but also getting correlation out of the box and enabling users to just get less standards. So I think that this is finally the tool that is tipping the balance into creating one less standard. So Carvel, one of the CNCF projects, just because it's very close to my heart. When I was at VMware, I worked with Carvel to leverage the consistent experience to deploy applications into any Kubernetes distro. And we did it using an operator that is called CAPController. It is a fantastic project and I think it's got a lot to bring to the community. The last one is Crossplane. Crossplane, just because I feel that it's one of those tools that extends the Kubernetes concept. So not just to infrastructure, but it tries to break it into services and also to truly enable Kubernetes to deploy this hybrid multi-cloud experience.
Bart: One of our guests, Matt, shared that most teams store logs and metrics in Kubernetes without considering the implications of the data they collect. Consequently, they end up paying a hefty price for data that is not actually used. What's your advice on ingesting, storing, and querying metrics in Kubernetes?
Miguel: So I like to think of it from three perspectives. One of them is cardinality and data decay. When I say cardinality, it used to be very normal that only the largest companies could deploy 100 servers, 1000 servers, 1000 VMs. Nowadays with Kubernetes, any of the folks in this room can open up their laptop, deploy 30,000 pods. All of a sudden, you have 30,000 things to monitor. If, let's say, a pod is emitting 50 metrics, you're going to have 1.1 million metrics every 5-10 seconds. So you've got to be careful about what you ingest. Data decay, same thing as time passes, data loses value. So it's quite important to understand at what point you want to start rolling up the data, sacrificing granularity for the sake of storage. The second aspect, and one where I think that a lot of the vendors have a lot to bring to the table, including the OpenTelemetry project, is entity-centric observability. So, we need to switch it to make it easier for users to actually understand what they care about reflected in the configuration of metrics ingestion. So they're able to understand, it's not just about wanting to monitor Kubernetes and turning on the floodgates. No, I want to monitor Kubernetes, but perhaps I'm just on the application layer. So I don't want to ingest all the granular metrics of the control plane. I just want to ingest the workload objects because these are the things that I really care about. This brings me to the last concept, which is about user role-centric observability. As you can see, many users now like to use managed Kubernetes, and there are even some distros there that actually abstract the management of nodes. So that sort of data, you perhaps don't need to have it, or maybe you want to have it, but at a very high level. So basically the concept of, I don't need to monitor this until I do. And then when you do, then you start really starting to get more data. So just trying to be careful with the data that you ingest and be more, a little bit more, how can I call it, more conscious that every metric is going to cost you.
Bart: Kubernetes is turning 10 years old this year. What should we expect in the next 10 years to come?
Miguel: If you saw the keynote, of course, AI. But one of the things that I saw in the keynote that was very interesting is that one of the folks, the brains or pretty much the minds that have done more in AI, only met Kubernetes a year ago. So when you consider this, there is a huge gap that needs to be breached. Between the Kubernetes and the infrastructure orchestration community to make sure that we help the revolution happen. So I wouldn't expect those folks to have enough on their minds thinking about how they need to make AI work for us to ask them to become all of a sudden Kubernetes experts. So we need to bridge that gap. We need to make it easier for these two worlds to help each other. The second aspect that I'm touching on is the standardization and abstraction of the operation of Kubernetes has got to become easier. So to the effect of being able to get a broader range of users to be able to use Kubernetes. So I see that a lot of the cloud providers are doing this, but I think that Kubernetes itself can also do some of this work to also enable us to standardize this across the distribution and not just because vendors are adding their own abstraction to Kubernetes.
Bart: What's next for you?
Miguel: So, like I said, I work on the cloud-native observability side. I've been deeply involved with open telemetry in the most recent months. One of the things that I really like is that I'm an industrial engineer by studies. When I was working in supply chain systems, I did a lot of thinking around system syncing entities and balancing supply chains. So I've been talking a lot with the folks now that I'm getting involved in the community of open telemetry about how we can bring some of that system syncing into the telemetry itself. It's not just about collecting signals, but also being able to capture the relationships between the entities. So, to understand the impact of having a relationship between, for example, a pod and a node and understanding if one of these entities goes down, what is the impact on the overall health of your system? So a lot of this thinking is some of the things that I keep thinking about these days.
Bart: How can people get in touch with you?
Miguel: I would say just look me up on LinkedIn. I got a Twitter. I'm not as active on LinkedIn, but just add me, Slack me on the CNCF, or ping me on the CNCF Slack. You're going to find me. I think for some reason I got two users, but I monitor both of them. But, add me on LinkedIn or follow me on LinkedIn.