Running Kubernetes at the edge: scaling to 15,000 clusters and beyond

Dec 13, 2024

Guest:

Raghushankar Vatte

This interview explores how Kubernetes at the edge is transforming from a complex infrastructure component to an essential platform for distributed computing.

In this interview, Raghushankar Vatte, Field CTO and VP Strategy at ZEDEDA, discusses:

Managing distributed Kubernetes at scale, with real-world examples from deployments across 15,000+ dealerships, maritime vessels, and automated farming systems.
The transition from treating clusters as pets to cattle, highlighting how automation and standardized policies are crucial for managing thousands of remote deployments.
How observability and monitoring are evolving beyond basic health checks to provide predictive insights across entire fleets of edge clusters.

Relevant links

Transcription

Bart: Welcome to KubeFM. Can you tell our audience who you are, what your role is, and where you work, specifically at ZEDEDA?

Raghu: First of all, thanks for having me on KubeFM. I'm Raghushankar Vatte, the Field CTO and VP Strategy for ZEDEDA, based out of San Francisco.

Bart: Now, what are three Kubernetes emerging tools that you are keeping in mind?

Raghu: It's an interesting question. To give you context, ZEDEDA enables edge orchestration and management. We have customers like the number two automaker in the world, where we enable around 70,000 dealerships. We are already deployed in 15,000 of them and are ramping up to 70,000. We are also deployed in commercial maritime vessels that travel across the globe, as well as cruise ships, oil and gas fields, and retail stores.

The reason I find this interesting is that when people think about Kubernetes, they often think about large, sophisticated clusters doing complex tasks, from inventory to AI. However, what we've seen at the edge is a major shift in paradigm. There is a need to deploy and operate containers, and once the number of containers increases to 10 or more, people start thinking about using Kubernetes because it is proven to work well in the cloud and data centers.

However, when they start using Kubernetes at the edge, they find that it's not the same as running it in the cloud. Instead of running a large cluster, they are running 15,000 clusters dispersed across the globe, some with connectivity and some without. They need to set policies, including security policies and admission policies, not just for Kubernetes but also for the assets running on it, from hardware to applications.

When my customers ask if we can scale to 70,000 easily or to a few thousand on ships roaming in the middle of nowhere, I need to solve that problem. That's why I started by saying it's an interesting problem. We are dealing with a different side of Kubernetes, where we need to develop new tooling or variations of existing tooling to make it work in this environment. This is edge computing, from edge AI to basic applications running in remote locations.

It's hard for me to say what three tools I'm looking for, but I'm interested in things that can orchestrate a fleet of Kubernetes clusters, provide handles to enable different policies, and offer visibility. ZEDEDA is investing in some of these things and has a roadmap for them, but if there are tools out there that can do this, we don't want to reinvent the wheel. We would be very interested in knowing about those tools and bringing them on board.

Bart: One of our guests, Dan Garfield, shared that your cluster only feels real once you set up ingress. Before that, it was just a playground where it didn't matter what you did. What are your thoughts?

Raghu: I completely agree. When you have a pet, you name it, feed it, and give it specific exercises. You are very specific about its care. When I look at very large clusters, I think of them as pets, with people assigned to take care of them. However, in the environment that ZEDEDA operates in, we cannot afford to think of our assets as pets because we have a large number of them - 20,000 assets require a different approach. When you have a large number of assets, you should be able to separate them into fleets or sub-fleets, set policy, and receive alerts when things don't work. You should then be able to react to those alerts.

When you have something working in your lab, you have full access to it and everything is good. However, when you put it out in the field and encounter issues like DNS problems, you need to take additional steps to ensure it can operate in the wild. That's where you transition from treating assets as pets to treating them as cattle. When you can repeat a process for a large fleet of assets without personalizing every single instance, that's where you can scale both operationally and economically. I completely agree with the comment about patterns versus pets. If you start treating everything as pets, you won't be able to scale. Once you start treating something as cattle, you can identify the things you need to do from a technology, operational, and security policies perspective, and that's when you can scale. You can only enable scaling when you can do it in an economical fashion; otherwise, it's not feasible.

Bart: In terms of automation and resource management, one of our guests, Alexandre, expressed that having an automated mechanism is better than enforcing processes. What automation tools or approaches do you recommend for managing Kubernetes resources?

Raghu: One of the things we have seen is that when you are running in a data center or in the cloud, you have a workforce managing these assets. If you have an IT team managing these assets, they will have a regular schedule, such as a calendar that says when to perform checks or triggers and alerts. However, when you have 100 different locations, it becomes more complex. For example, we have a deployment with the third-largest maritime company in the world, which has containers that are air-cooled. If the device or container is not working properly, the perishable food inside will spoil, resulting in economic losses.

They have solutions where they run a Kubernetes cluster on the ship, not just one, but several, for inventory, sensors, and connectivity. However, when changes or maintenance are required, they are at the mercy of the IT staff on the ship. Unfortunately, it is unlikely that the IT staff on the ship will fully understand the complexities of the system. As a result, the system is only as good as the skills of the IT staff on the ground.

We can put hundreds of admission policies in place for security, capacity management, and application distribution, but we are still dependent on another entity to take care of them. At ZEDEDA, we believe in automating as much as possible to minimize the need for manual processes and security measures. When dealing with a large number of assets, uniformity is key, and the only way to achieve this is through automation.

Automation is necessary to manage Kubernetes clusters and the assets hosting them, as well as to continuously monitor the health of the runtime, hardware, and applications. This allows for the identification of issues, such as three "cattle" that are not functioning properly, which can then be isolated and addressed by experts. Automation is essential, especially when dealing with 15,000 Kubernetes clusters spread across the globe, where language barriers and other challenges come into play.

We have dealt with these challenges firsthand and have implemented automation in our product to ensure manageability and allow our customers to focus on outcomes. For example, the folks on the ships are concerned about the temperature variation during transport, and they need to ensure that it does not fluctuate more than 10 degrees to prevent spoilage. They should be able to focus on these concerns while automation takes care of the underlying systems.

Bart: Transparently, when looking at observability and monitoring, one of our guests, Miguel, explained that while monitoring deals with problems that we can anticipate, for example

Raghu: Running out of disk space, observability goes beyond that and addresses questions you didn't even know you needed. Does this statement match your experience adopting observability in your stack? Yes, 100%. This is the story of the past five years. I have seen it, and it has nothing to do with what we are doing. What is happening is that there is a lot of monitoring. We all know how to monitor logs, parse them, and gain insights based on what is happening. Now, what has changed is that we are not only monitoring but also starting to correlate, and this is where new AI is helping quite a bit. By the way, every interview I have to say at least once that AI is really helping because now you can start correlating not just what is happening on one site, which is very much applicable to us when we are dealing with 15,000 to 20,000 sites or clusters. Monitoring is one thing, in terms of checking the health of the system, which might seem like a simple thing, but when you are dealing with edge, that in itself is a very valuable thing. On top of that, knowing how hot it is running, or if your applications are running properly, or if they are overloaded, is part of monitoring. But what I can do is not only get that information but also get information about the assets on which it is running, whether it is hardware or other stacks, and also about the network traffic that is going out or other things associated with that particular route. When I'm able to gather all of that and make sense of it, and learn more insights from it, and not just because I have access to that, but because we have access to such a large number of endpoints, we can actually make a good correlation in terms of understanding what is happening. For example, I know that when something like this is happening, or when somebody is moving from x to y, or there is a particular region where the dealerships are not performing as well as they should, just because there is also an outage, and let's say there is a DNS outage. We can sense all of that and then come up with actionable things that the operators can go and fix before it happens. This is again the reason why I said it is the story of the past five years, because we started with preemptive maintenance, and we also started with understanding when things are going to fail and trying to take actions about it. But now we are getting into the phase where it is not about that, but it is about understanding the full stack, from silicon all the way to your applications, and actually the policies and processes about that, and then figuring out where the thing is that is causing something to go in one way. Sometimes it is not everything that is bad; sometimes it is also the positive things. Some sites are working really well; how can we replicate that into other sites? What are the other ideas? Monitoring versus observability makes a lot of sense. When you have the luxury of having a large data set, that is where you can do learning, you can make inferences, and the accuracy of the inference increases dramatically. That's the reason I think observability and monitoring are very, very important, and we are going into the observable space. Monitoring is not going to go away. As an employee of ZEDEDA, I can attest to the importance of these concepts.

Bart: Anybody who likes it is going to stay with it as it is the base for foundational things, but we are getting a lot more value out of the same bits that we used to see. I know you've given some context about what we've been doing for the past five years, thinking about the fact that Kubernetes turned 10 years old this year. What can we expect in the next decade of Kubernetes?

Raghu: I don't want to prophesize and say this is what is going to happen, but I have a pretty good idea of where the trend is, and I have a clear wish for what it would be. The trend is that Kubernetes is going to become more and more sophisticated, both in the data center, in the cloud, and at the edge. Right now, I'm talking in terms of three distinct areas: running Kubernetes on the edge is different from running it in the cloud or in a data center, but that's going to change. It's going to be a very smooth transition from one to another because, at the end of the day, you're doing all these things to analyze something or provide a service.

For example, when a car goes in front of a mechanical arm in your garage, the mechanical arm goes up when you badge in. There's a lot of processing happening, from number plate reading to scanning your badge, identifying whether it's the right thing, and then telling the mechanical arm to go up. All of those things are happening, but at the end of the day, the data being collected is being sent. I don't think Kubernetes is going to run on the edge in a vacuum; it's going to connect back. As more and more people start using it, which is already happening, it's going to get very sophisticated in terms of becoming seamless.

I also think Kubernetes is going to completely disappear, even though it's there. The way I like to think about it is like having a microprocessor in your laptop. When I talk about your laptop, I talk about whether it's an Apple or a Dell, but very few people talk about whether it's based on Intel. They just expect a processing unit to be there and do its work. I think Kubernetes will become an integral part of all the compute happening both in the cloud and at the edge, but because of its sophistication and evolution, it's going to become almost transparent. That's what we want to do at ZEDEDA, where we really care about enabling the edge and enabling outcomes. No one wakes up and says, "I'm going to enable Kubernetes at the edge today." People wake up and say, "I'm going to create value at the edge by running these applications and bringing these outcomes."

Kubernetes is the way they're going to do that, just like how silicon became completely transparent to people. They don't think about silicon; they think about the color of the laptop or the resolution of the camera. I see that trend happening, and I wish it becomes like that sooner, hopefully not in 10 years, because there are many cool things happening at the edge in terms of applications. Most of them are bottlenecked by security policies, where people are not comfortable putting a Kubernetes stack or any other stack at the edge to manage outcomes and trust it to work day in and day out.

If that mechanical arm in your garage doesn't open, you're not going to dinner, or you're not getting out of your office. Those are the kind of things that have a big effect when they stop working. As Kubernetes evolves, both from the cloud side and the edge side, it becomes transparent, and people will start thinking more about the applications because they trust it. Just like we trust the cloud to do its thing when we launch, we're only worried about upgrading or paying more, not whether it will work or not. That's the same thing that's going to happen at the edge, and that will lead to more applications and sophisticated applications being brought down to the edge.

Bart: Now, getting towards the end, what's next for you and how can people get in touch with you?

Raghu: We have announced ZEDEDA, ZEDEDA Kubernetes Service, in June last year. Since then, we have done a lot of work and acquired multiple customers. This is not just software for us; it is a real product. We are talking about 15,000 Kubernetes clusters in dealerships and tens of thousands of Kubernetes clusters. Believe it or not, there are actual robots that help in milking cows in farms, and we have a customer that has 20,000 of these farms, each running a Kubernetes cluster. They want to manage these clusters, and we have a running joke that we are in the full supply chain of meat and dairy, from the cows to the retail store.

We are just getting started. What we really want to do is get to a state where our customers can enable these outcomes at the edge by looking at Kubernetes as just another platform. They should be able to deploy Kubernetes directly using ZEDEDA, including the runtime security and applications, and focus on their applications and performance. We are also investing heavily in AI at the edge, not just in terms of Kubernetes, but also outside of it. However, it is hard to imagine AI at the edge running without Kubernetes, as most generic applications require Kubernetes.

We will be busy for the next one or two years, and our goal is to make Kubernetes at the edge something that our customers trust and consider common. We want to eliminate the risk of them worrying about losing their job because they installed Kubernetes on 10,000 sites and it doesn't work. We are also looking at security policies very closely, including application security, runtime security, hardware security, and network security. We are partnering with multiple players in this area to provide our customers with a simple way to enable their edge applications and solutions using Kubernetes.

The best way for people to get in touch is to reach out to anyone at ZEDEDA. I can be contacted at raghu@zededa.com, or you can visit ZEDEDA and click on "Contact Us". If you are interested in a demo, you can also request one on the website.

Bart: Thank you so much.

Raghu: Take care. Thank you.

Podcast episodes mentioned in this interview

Clusters are cattle until you deploy ingress
with Dan Garfield
The basics of observing Kubernetes: a bird-watcher's perspective
with Miguel Luna
Configuring requests & limits with the HPA at scale
with Alexandre Souza