Managing tool sprawl in modern platform engineering

Guest:

Nathan Goulding

In this interview, Nathan Goulding, Senior Vice President of Engineering at Vultr, discusses:

The launch of SLIK (Slurm in Kubernetes), an open-source tool that enables running Slurm workloads natively on Kubernetes clusters, bridging the gap between traditional HPC and cloud-native environments
The challenges of platform engineering at scale, emphasizing how the right balance of tools like Argo CD, Prometheus, and Grafana can prevent tool sprawl while maintaining development velocity
His vision for Kubernetes' next decade, predicting expansion beyond dev teams into traditional IT, with a stronger focus on organizational compliance and governance frameworks

Relevant links

Transcription

Bart: The host is Bart Farrell . The speaker is Nathan Goulding (works for Vultr).

Nathan: I'm Nathan Goulding, Senior Vice President of Engineering for Vultr. Vultr is the world's largest independent cloud provider. We deploy fundamental cloud infrastructure into 32 global data centers around the world. We offer a suite of services from cloud compute to bare metal to GPU to managed Vultr Kubernetes Engine. I'm excited to be here today.

Bart: What are three Kubernetes emerging tools that you're keeping an eye on?

Nathan: I'm going to discuss one technology that I'm really excited about, called SLIK (Slurm in Kubernetes), which is Slurm in Kubernetes, or S-L-I-K. It's an Apache 2.0 licensed software that we released at Vultr this past year. SLIK (Slurm in Kubernetes) allows Slurm workloads to run on top of Kubernetes. It's a cloud-native suite of tools that can be used, particularly by AI and machine learning engineers who come from a Slurm environment where they're running Slurm inside their bare metal infrastructure and want to consume it in a cloud-native way. They can use their Slurm native workload jobs and run them directly on top of our Vultr Kubernetes Engine or any Kubernetes platform. It's open source and available on GitHub at github.com/slik. I'm excited for you to check it out.

Bart: Taking a look at some questions related to podcast guests that have been on the podcast, the first topic is availability and platform engineering. Our guest Hans compared delivering software now to 20 years ago. He mentioned that while downtime was acceptable in the past, it isn't today. Hence, building platforms on top of Kubernetes requires more tooling than ever. Is it possible to keep tools from sprawling at bay? What kind of tools are essential for building mission-critical platforms?

Nathan: That's a great question, and it reflects the maturity of the application development lifecycle and cloud delivery. Ensuring that you pick the right tools for the job is really important to do upfront, and you should not add new tools and software to your application stack without properly understanding how they fit into the overall architecture. In terms of the actual delivery of applications, Vultr is the cloud platform for platform engineering teams. When it comes to picking the right tools, it's essential to consider both the tools and software applications you use, as well as the actual infrastructure you use to deploy them. Choosing a limited set of mature software stacks to deploy and run applications can lead to great success in terms of what a platform team can offer to their application developers. Kubernetes, being the job scheduler and container runtime environment, is the underpinning of a lot of it. However, how you integrate that into your overall CI/CD pipelines is crucial. Many people leverage Argo CD for deploying their applications and upgrading their pods and clusters in production. In terms of observability and monitoring, having a great observability stack is essential. We use Grafana Cloud for that, as well as our own internal logging and monitoring, and we use Prometheus extensively. It's really important for platform engineering teams to offer their application developers a core set of software that allows them to deliver their applications in a reliable manner in production. This also allows developers to experiment with emerging technologies in lower environments, giving them the freedom to try new technologies that solve specific use cases, even if they are not ready to move into production.

Bart: Platform engineering and people. One of our guests, Ori, shared that rushing into solutions without understanding the root cause can lead to fixing symptoms instead of the actual problem. He mentioned the case of network policies and how sometimes the root cause of a problem is a people problem and the solution lies in addressing that. What is your experience with providing tooling and platforms on Kubernetes to other engineers? What are some of the soft challenges that you faced?

Nathan: What a great question. I think that a lot of the problems we run into actually do boil down to either organizational or people problems. The communication structure is equivalent to the technology structure in terms of the organization, and this manifests itself in many different ways. People tend to understand the technology they have used themselves or what their team uses, and they naturally gravitate towards deploying something for themselves within their own autonomous structure of the organization, without fully understanding the broader context of the entire platform. As a platform engineering team, it is really important to have proper socialization and education around the tools provided to the entire application developer ecosystem. This way, if an issue arises, it is more easily troubleshooted, and the root cause can be identified. With the proliferation of microservices, there are famous examples of large companies that have opened up their entire platform to all their developers. This can lead to a situation where a developer or the entire application development team may have a thousand or two thousand different microservices, resulting in a ton of duplication of code across these microservices. This is because fully autonomous development teams are completely empowered to make their own decisions and roll out their entire application as they see fit. However, this often results in a lack of optimization and difficulty in troubleshooting the root cause when there are many interconnected dependent services with duplicate functionality. I think it's essential to strike the right balance between having a strong platform engineering team that can provide a common set of services to the entire application development team and providing a level of autonomy and empowerment to development teams to make their own decisions, which they feel are best for themselves or their team.

Bart: Learning the hard way, one of our guests, Luca, decided to learn how containers work by building his own Docker in C in his spare time. What's your strategy for learning new things, and how do you keep up with all the Kubernetes tools being released daily?

Nathan: I don't recommend reinventing the wheel when it comes to the plethora of tools available to us. At a conference a couple of years back, someone made the joke that a software development team found many tools that provided 90% of the capability they needed, but because the tools didn't provide the last 10%, they decided to write the entire 100% themselves. This scenario happens a lot. There are environments where it is appropriate to experiment and try new technologies and tools, introducing them into the existing environment. However, with the numerous tools available to developers, it's essential to refine where to spend precious development resources. It's crucial to determine whether it makes sense to redevelop the wheel from scratch or if mature software platforms and solutions can meet enough of the need, allowing for integration and addition of necessary capabilities without starting from scratch.

Bart: Kubernetes turned 10 years old this year. What can we expect in the next 10 years?

Nathan: Kubernetes has established itself as probably one of, if not the fundamental underpinning of how workloads get scheduled today. The first 10 years have been defined by the growth in the proliferation of Kubernetes, the maturity of the platform, and its adoption. In the next 10 years, this maturity will continue, extending to other parts of the organization where Kubernetes might have historically been isolated to the application development teams. Kubernetes will extend into more traditional IT environments and other areas, introducing new requirements around compliance and governance. The platform already has a great technical underpinning, and extending it to allow controls and policies to be applied at the organizational level will enable broader adoption across the organization. This is an area of growth that Kubernetes will likely see over the course of the next decade.

Bart: With that in mind, what is your least favorite Kubernetes feature? What feature would you like to see improved or removed in the coming months or years?

Nathan: That's a tough question. I think Kubernetes is an incredibly powerful platform, and it's hard to pin down one exact thing that I would request to be removed. Every feature that has been developed has been developed for a very specific reason. It comes down to using the features and capabilities that are present and needed, without a lot of the extra that comes with them. Honestly, I think Kubernetes is a great fundamental technology, and it's really hard to pinpoint exactly one feature that I would say is unnecessary or went in the wrong direction. Generally speaking, it has good technical governance and oversight, and a good roadmap.

Bart: What's next for you?

Nathan: So, look at Vultr: we have a global cloud infrastructure platform operating in 32 global data centers. What's next for us is continuing to bring the latest technology to our customers, including the latest CPU technology from AMD and Intel, as well as GPU infrastructure from both NVIDIA and AMD. We are in a unique position as a modern hyperscaler to deliver large quantities of cloud infrastructure to our application development teams and platform engineering teams at a speed that other companies cannot compete with. There's a lot of work ahead of us. We operate a global platform with a huge infrastructure fleet that we manage on a day-to-day basis. We will continue to add capability to our platform, including increased Identity and Access Management (IAM) and integrating IAM into our account model, as well as how instances are managed and how our Vultr Kubernetes Engine integrates against those IAM roles for fine-grained access. This will be an incredibly powerful feature for our customers. We will also continue to expand our cloud platform in ways that our customers are asking for, which will be an exciting thing to work on for the next year or two.

Bart: How can people get in touch with you?

Nathan: Reach me at ngoulding@Vultr

Podcast episodes mentioned in this interview

Network Policies are the wrong abstraction
with Ori Shoshan
Barco: Linux containers from scratch in C
with Luca Cavallin
Platform engineering: learning from the Kubernetes API
with Sven Hans Knecht