Taming tool sprawl: building mission-critical platforms on Kubernetes

Nov 14, 2024

Guest:

Karthik Ranganathan

From tools sprawl to the future of Kubernetes, this interview explores the evolution and challenges of cloud-native infrastructure.

In this interview, Karthik Ranganathan, Co-founder & Co-CEO at YugabyteDB, discusses:

How to tackle tools sprawl in Kubernetes by focusing on areas of concern rather than individual tools.
The future of Kubernetes in the next decade, highlighting the need for simplification, better cloud neutrality, and improved support for hybrid deployments.
Making databases cloud-native through improved deployment patterns, resilience, and observability, with insights from YugabyteDB's journey with PostgreSQL.

Relevant links

Transcription

Bart: Hi Karthik. Can you tell us who you are, what your role is, and who you work for?

Karthik: Hi Bart. I am Karthik Ranganathan, and I'm the co-founder and a co-CEO at YugabyteDB. That's the company name. We are a company that builds a relational database for the cloud. We're simplifying data.

Bart: What are three emerging Kubernetes tools that you're keeping an eye on?

Karthik: The fun thing about Kubernetes is that there are about 300 tools that are important to put together in order to make a functional solution. So, if I'm just mentioning three, it probably doesn't mean that those three are the only important ones. The three tools I'll talk about are: first, in the area of observability, Prometheus and, as a bonus, Thanos. The second one is around how network traffic gets shaped, such as Istio and Ingress. The third one consists of the most fundamental tools in Kubernetes itself, which are things like kubectl, Rancher, and the operators that form the core fabric of how we use Kubernetes. As a bonus, I also track k3s, which is an edge-based deployment of lightweight Kubernetes. So, those are the top three tools that I think about and follow quite a bit. However, there are many more that are important as well.

Bart: Scaling their Jenkins platform on Kubernetes to 10,000 builds per week, one of our guests, Stéphane, and his team found that there are limits to network, CPU, I/O, disk space, or memory. As they hit them, they restricted usage and worked around them. What's your approach to managing constrained resources in a Kubernetes?

Karthik: I would give a triple mental construct. To think through this, consider the following principles. First, think in terms of repeatable units. Instead of making one giant deployment where things get out of control, it's better to make mid-size to not super large deployments where everything is not restricted. Think in terms of blueprints that are repeated. Constrain the size and create more repeated units, but don't make the size so small that you have a thousand or 10,000 units, which becomes unmanageable and causes a different setup.

Second, use the limits function in Kubernetes intelligently. Software built on top of Kubernetes, like cloud-native architecture, should inherently understand pressure and throttle back when under constraints. It should change its behavior to protect the service it wants to give users while being able to handle and surface out meaningful information. Pick decent limits and pick software that's architected for cloud-native. This is the difference between pre-cloud-native and cloud-native software.

Third, invest in the right kind of actionable observability. When you run out of resources, the real answer is simple: you ran out of this or network. However, getting to that answer can be tough because you see symptoms all over the place, and it's very tough to zero in. It's essential to have the right metrics, not too many, which doesn't tell you the story, and not too few, which doesn't capture the whole story. You need to make sense of all the metrics quickly. Know what happens in these cases in your stack and be able to get to the root cause.

In summary, the three key points are: think in terms of repeatable units or pods that you deploy, use software architected for these kinds of deployments that understand native rate pressure properly, and invest in observability that exposes these pressure points in a very clear manner rather than sifting through a lot of data.

Bart: And a 15,000 pod cluster, one of our guests, Faris, explained their transition from Prometheus to Thanos, and eventually their decision to adopt Victoria Metrics for the metric stack. What's your experience in monitoring and managing large-scale clusters?

Karthik: As you know, one of the top tools for me to look at is Prometheus and Thanos, and I am also following Victoria Metrics. Ultimately, when you have large-scale systems, like we discussed in the previous question, managing a large-scale deployment is crucial. Whether you break it down into many units of deployments or a single unit, you want to have a single pane of glass for the whole system. You don't want to have many panes for each of those components, probing them in arbitrary ways. You want to know if there's a problem, where the problem is, if things are inefficient, and what the utilization is.

Prometheus is fantastic as a software that brings system data together, but it's not very stable - that's what we've found. You need to set up an elaborate set of aggregators, push and pull, and allocate memory. Thanos is a project that helps you scale that out pretty well. Our experience with Thanos has been good, and we use it for our fully managed cloud. We've been keeping an eye on the system at large to see what improvements we can adopt.

The problems they're addressing are spot on. No data means you're blind, and you cannot automate operation. The more you learn about the shape of data when you hit a problem, the easier it is to build remediation, detection, and auto-resolution. The point is to eliminate the need for people to run things when systems can do it, especially given the complexity of these autonomous cloud-native systems.

Bart: One of our guests, Hans, compared delivering software now to 20 years ago. He mentioned that while downtime was acceptable in the past, it isn't today. Hence, building platforms on top of Kubernetes requires more tooling than ever. Is it possible to keep tools from sprawling out of control? What kind of tools are essential for building mission-critical platforms?

Karthik: I think this is a great question. The issue of tool sprawl in the Kubernetes world is not really about whether it's a tool sprawl or not, but rather how Kubernetes is architected and how the ecosystem has developed. At its core, Kubernetes does a few things and provides many hooks for people to add on to it. Everyone takes one area of concern, similar to microservices in the software development world, where there are many micro tools that come together to achieve a specific goal.

Rather than thinking in terms of tools, it's better to think in terms of areas and tool stacks in those areas. This approach helps to break down the overall stack into manageable layers or areas of concern. For example, one of the most basic areas of concern is infrastructure as code, specifically how to deploy infrastructure as code. There are several options available, such as Helm, operators, Terraform, and others. The choice of tool depends on the business needs, technical requirements, and organizational culture.

Another important area is observability, which is critical for detecting issues quickly. This is where systems like monitoring and logging come into play. The last area is portability and open standards, which is crucial for ensuring that systems can be deployed anywhere. This is why many Kubernetes-based systems are open source and adhere to open standards.

On top of these four building blocks - infrastructure as code, resilience and uptime, observability, and open standards - the next layer to consider is scalability. This includes cost efficiency and the ability to scale horizontally or vertically without requiring significant effort. Systems that can scale transparently will be more successful.

Another important aspect is resource utilization, which involves setting limits and allocating resources efficiently to maximize the number of applications that can be run in a cluster. This is similar to bin packing, where the goal is to fit as many items as possible into a box.

The final layer to consider is the integration of AI/ML and auto-remediation systems. Kubernetes can enhance these systems by automating reactions and remediations, allowing for better detection and remediation. This requires careful planning of the overall tech stack, infrastructure stack, and deployment stack to support AI, ML, and auto-remediation.

Bart: Kubernetes is turning 10 years old this year. What do you expect in the next 10 years?

Karthik: Great, great question. One I often think about. So, I'd say a few things about Kubernetes. The fact that you had to ask how many tools it takes for Kubernetes and whether it's possible to reduce them means there are too many. I think one thing that will happen to Kubernetes is an insane level of simplification. I'm hoping to see this, and I've started to notice it. This simplification could be achieved by reducing the number of tools or creating tool stacks that make it simple for people to take a stack of tools and roll them out without caring what's inside.

Another area I'm interested in is Kubernetes' cloud neutrality. While Kubernetes is cloud-neutral, it's not always cloud-native. This is particularly true for network architecture. Everything else is cloud-neutral, but network architecture is not, due to Kubernetes' encapsulation design philosophy. Encapsulation makes it difficult to extend and makes the system closed. When thinking about cloud deployments, data tends to be duplicated, replicated, or stored across different regions, geographically or across different clouds, on-premise, or in various locations. Multi-cloud, hybrid cloud, and multi-region deployments are becoming a reality, and Kubernetes needs to evolve to support these applications. From a data perspective, there are very few patterns, and it's tough to get Kubernetes networking to work in these scenarios.

This makes Kubernetes cloud-specific. For example, if you use GKE, there are certain patterns for multi-region deployments. If you use EKS or AKS, it's completely different. If you roll your own Kubernetes, it's yet another thing. There are many different tools, such as MCS, Istio, Envoy, and Egress, which makes it an actual design challenge. It's not a cookie-cutter thing. I'm interested in seeing this portion simplified, as it would make our lives easier. People often ask us, and we have to tell them to figure out which approach to take. We then need to work with them to understand their specific needs.

The third area I think will see maturity is on-prem and off-prem deployments. I see a lot of repatriation, with people moving back to on-prem, although it's not as common as moving to the cloud. Being able to support hybrid deployments and private cloud deployments is another area of maturity. This includes all kinds of workloads, including AI/ML.

Bart: And with that in mind, what's next for you?

Karthik: With respect to the database, our mission is to make Postgres cloud-native. When you break it down, it means keeping everything in Postgres as it is, without removing any functionality, and making it work in a cloud-native environment. This includes making it easy to deploy and redeploy, resilient, scalable, and multi-region, with observability. We're a good chunk of the way into that journey, and our customers and users are seeing a lot of value, but there's still a lot to be done.

We're also seeing a lot of requests from folks asking us to help modernize existing applications. This can be done in two parts: one, moving the application and data into Kubernetes, or two, moving the application into Kubernetes while keeping the data elsewhere, but with the same properties. We're building an exciting open-source tool called Voyager, which helps take a traditional application and automatically figures out how to modernize it and make it cloud-native by analyzing data, query patterns, IOPS, and access patterns, and suggesting changes.

We're building Voyager as a co-pilot, and another co-pilot that sits alongside the application, analyzing access patterns and suggesting improvements. This co-pilot can identify non-cloud-native patterns and suggest better ways to write the application. We're seeing interest in this from people using mainframes, traditional databases like Oracle and SQL Server, and even Postgres and MySQL.

The third piece is innovating on the data side with GenAI, as we're seeing a lot of interest in this area.

If people want to get in touch with me, I'm fairly active on LinkedIn, somewhat active on Twitter, and I'm also active in our community Slack, which has over 10,000 members.

Bart: People are asking all sorts of interesting questions and sharing use cases, so any of those ways you know work perfectly. Thank you so much for your time and for sharing your knowledge with us today. We look forward to seeing you in person at KubeCon. Take care.

Karthik: Thank you for having me on KubeFM. Bart Farrell, I really enjoyed it. Thank you for all the great insight and questions. My pleasure. Take care. Cheers.

Podcast episodes mentioned in this interview

CoreDNS will fail you at scale (with default settings)
with Mohamed Hamdan Faris S M
Platform engineering: learning from the Kubernetes API
with Sven Hans Knecht
From 0 to 10k builds a week with self-hosted Jenkins on Kubernetes
with Stéphane Goetz