Platform engineering challenges: balancing simplicity and autonomy in Kubernetes

Guest:

Roland Barcia

This interview explores how Kubernetes is evolving to support modern workloads while addressing platform engineering challenges.

In this interview, Roland Barcia, Director at AWS leading the specialist technology team, discusses:

How emerging tools like Karpenter and Argo CD are adapting to support diverse workloads from LLMs to data processing
The balance between platform standardization and team autonomy in Kubernetes environments
The future of Kubernetes and its evolution to support new workloads like LLMs and stateful applications

Relevant links

Transcription

Bart: First question: who are you, what is your role, and who do you work for?

Roland: Hello, my name is Roland Barcia. I am a director at Amazon Web Services (AWS) and lead our specialist technology team, which includes specialists who help customers adopt containers and Kubernetes for various workloads.

Bart: What are three Kubernetes emerging tools that you are keeping an eye on?

Roland: At this conference, there are a lot of projects in the CNCF. One is Karpenter, which I think is an exciting project, especially since it just became part of the Kube scaling SIG. With a lot of diverse workloads like agents, LLM agents, data processing, and stateful workloads, having a dynamic scaler that can pick different types of instances, such as GPUs and transactional instances, is really an opportune time. Another project I'm looking at is Argo CD, which has been around for a while, but I think the expansion of that platform, including Argo Workflows and Argo Events, and how it's grown into a platform to do orchestration beyond just DevOps and GitOps, is notable. I'm also looking to see what emerges around MLOps tools, as I think that's still in the early days, especially for MLOps and LLMs on Kubernetes.

Bart: Now we're going to get into some questions based on podcast episodes we've had. One of our guests, when discussing [platform engineering](what is platform engineering?) and people, highlighted that the root cause of a problem is sometimes a people issue, not a technical one. What challenges have you faced when providing [Kubernetes](what is Kubernetes?) tools and platforms to engineers, and how did you address that?

Roland: I saw the interview and he dealt very quickly with Network Policy and how different workloads had different behaviors and effects. The main issue is a people problem, but it's also a platform problem. Kubernetes is a highly configurable platform, meant to support a range of workloads and work in various environments, such as in the cloud or on-prem. However, this abstraction comes at a cost. The platform is highly configurable to support different workloads, which can lead to issues between platform teams building tools and development teams or data science teams. The line of abstraction can cause problems, especially with cluster-as-a-service, where giving more control to workload teams may require them to maintain their own tools, resulting in a lot of autonomy. Many developers and data scientists like autonomy and tools, but this can make monitoring difficult. On the other hand, code-as-a-service or platforms like Backstage completely abstract the platform, which is a huge range. Building a product around this range can be challenging, especially as more workloads are deployed on the platform. For example, I have a customer in Israel that throws many default microservices onto a default cluster, but another team building a streaming platform with Kafka and data analytics requires a full cluster. My best practice is to have good defaults but also provide escape hatches for autonomy.

Bart: When it comes to the topic of using boring, simple tech, one of our guests, Martin Clausen, prefers simplicity and fewer abstractions in his Kubernetes setup to reduce cognitive load. What's your approach to maintaining tools in your Kubernetes cluster? Do you prefer a minimalist approach or a more expansive one?

Roland: I would say it really depends on how big your platform team is. Seeing teams that have a lot of default tools and standards typically have large platform engineering teams. However, the smaller your platform engineering team, the more difficult it is to provide that. I lean back to the previous answer, which is that you probably want to be as boring as possible for good workloads. For new patterns like LLM agents or stateful processing, where it's still emerging, you might want to give some autonomy, perhaps by providing a sandbox where tools can prove themselves. This allows for experimentation, and as they graduate to the production platform, you want to prioritize stability, because you want to avoid outages and ensure things work.

Bart: In terms of learning by doing, one of our guests, Mathias, emphasized that hands-on experience is key to learning Kubernetes. What's your strategy for learning new Kubernetes tools and features?

Roland: I really like the CNCF certifications because they actually emphasize hands-on tests, as opposed to maybe other types of certifications, which tend to be multiple choice. When I think about Kubernetes tools, the best approach is to spin up a cluster, install it, and use it for the default use cases. So, I ask myself why the tool was built and what problem it solves, and then work through that. Then, think of a problem in your own domain and try to apply the tool to it, because you truly learn by applying solutions to specific problems. I also think integrating different tools is another way to learn. Sometimes you build a tool and use another tool, like Kyverno for policy. You then realize you need to use policy and scaling, and you have a new workflow tool, so you explore how these tools work together. You can learn a lot by integrating different tools. Therefore, integrating tools and getting hands-on experience with the default use cases are the two methods I use the most.

Bart: Kubernetes turned 10 years old earlier this year. What should we expect in the next 10 years to come?

Roland: If you had asked that question three years ago, people would not have answered things like LLMs, generative AI, and prompt engineering. I think we're going to see Kubernetes mature and become more stable, as we work at the cluster level. Many problems at the cluster level, node level, such as those addressed by Karpenter, will be solved and become easier to manage, allowing for the abstraction needed for different types of workloads. The key to Kubernetes succeeding and staying relevant is its ability to evolve and support new workloads, especially now that it's not just microservices or stateless applications, but also stateful applications. If Kubernetes can evolve to support various workloads, it will continue to be relevant.

Bart: Bonus question. What's your least favorite Kubernetes feature or feature that you would like to see improved in the future?

Roland: I don't know if it's a feature as much as how quickly I need to upgrade, just the amount of upgrades that you go through and the changes that happen between versions have been a problem. This has been happening a lot recently, and solving these upgrade problems would be beneficial. I think things like CRDs are overly complex and I hope they become less verbose over time.

Bart: What's next for you?

Roland: Next week, I'm going to be flying to some customers who are building things around [large language models](no link provided for "large language models", please provide more context or a link to include) and being able to deploy them, and I'll be helping them build applications in that area.

Bart: And with re:Invent, can we expect a lot of conversations around LLMs, ML Ops, and these stateful workloads that you were mentioning earlier?

Roland: If you can go online, check out the schedule and there'll be a lot of sessions on managed Kubernetes EKS at re:Invent as well with these use cases.

Bart: How can people get in touch with you?

Roland: LinkedIn. I have my LinkedIn handle, which is roland-barcia-aws. Just reach out to me. I'm pretty responsive there.

Podcast episodes mentioned in this interview

Kubernetes on bare-metal: lessons learned
with Mathias Pius
Network Policies are the wrong abstraction
with Ori Shoshan
Pod topology spread constraints might not be the best solution
with Martin Humlund Clausen