Building enterprise platforms: taming Kubernetes complexity
A deep dive into modern Kubernetes platform engineering and the challenges of managing complex deployments at scale.
In this interview, Alex Chircop, Chief Architect at Akamai Technologies, discusses:
The importance of curated platform stacks to manage the growing ecosystem of tools, with insights on selecting and integrating essential components for enterprise-grade deployments.
Best practices for upgrading Kubernetes clusters at scale, including strategies for managing API changes and testing interconnected dependencies.
How to approach multi-tenancy in modern platforms, ensuring that core components like observability, CI/CD, and security work cohesively across team boundaries.
Relevant links
Transcription
Bart: Who are you, what's your role, and where do you work?
Alex: Hi, my name is Alex Chircop. I'm a Chief Architect with Akamai, working on Akamai's Cloud.
Bart: What are three Kubernetes emerging tools that you're keeping an eye on?
Alex: So, I'm currently focused on three main areas. Observability, because it has huge scale implications. Some of the new versions of Prometheus, OpenTelemetry, M3, and Victoria Metrics are particularly interesting. The second area is stateful workloads in Kubernetes. We're seeing exciting uses, such as TiKV, which allows massive scale within Kubernetes, as well as database operators like CloudNative PG, which enable complex disaster recovery and failover. Finally, I'm looking at the maturing platform engineering space, where there are numerous tools to help developers develop applications quicker and reduce the toil of managing Kubernetes platforms. We saw a huge uptick in this area during the platform engineering day, with keynote sessions touching on the subject of stable workloads.
Since we've discussed running stable workloads on Kubernetes multiple times, it seems that it's still problematic. However, it feels like we've come a long way. Obviously, for a long time, there was no such thing as a truly stateless workload, as almost every application has to store state somewhere. We've managed stateful workloads in a very different way to the way we manage applications in Kubernetes. I think now we're converging, as we don't want to have two ways of doing CI/CD, two ways of doing observability, two ways of doing logging, two ways of doing security policies. We want to combine those two approaches. Plus, Kubernetes is offering incredible scaling capabilities, with automated Kubernetes Operators enabling complex workloads, such as scale-out databases, and advanced features like disaster recovery, multi-factor replication, and services like that.
Bart: Since you mentioned platform engineering, one of our podcast guests, Hans, compared delivering software 20 years ago to now. He mentioned that while downtime was acceptable in the past, it isn't today. Hence, building platforms on top of Kubernetes requires more tooling than ever. Is it possible to keep tools from sprawling out of control? What kind of tools are essential for building mission-critical platforms?
Alex: That's a very topical question. The reality is that the cloud-native ecosystem today in the CNCF has over 200 projects. We have 19 new ones since the KubeCon Paris just a few months ago. What users need to get to is an application platform where we have platform engineering disciplines and a curated stack. These curated stacks effectively take out a lot of the work involved in figuring out how to implement a service mesh, security policies, logging, and observability, for example. This is something we're very focused on at Akamai, where we've just launched our Akamai Application Platform as a free tool to give you a curated stack of applications within your Kubernetes platform.
Bart: On the subject of multi-tenancy, one of our guests, Artem, shared that once you slice your Kubernetes clusters into a multi-tenant environment, you should consider multi-tenancy for every tool you install, such as logging, monitoring, and scaling. What's your advice on building Kubernetes platforms that are shared within an organization?
Alex: So I'm going to come back to platform engineering and the application platform. When people are choosing their stack and curated platform, they need to ensure that all components work together with multi-tenancy in mind. For example, when you define your users and their teams, you can integrate that into your CI/CD, GitOps, container repos, logging, observability, and secrets management. It is essential that when engineering your platforms and developing your application platform, all curated items work together in a multi-tenanted way.
Bart: Upgrading clusters. One of our guests, Pierre, stresses the significance of tooling and automation in managing Kubernetes clusters at scale. He and his team built tooling and developed procedures to test, manage, and upgrade hundreds of Kubernetes clusters. What is your strategy and process for upgrading a Kubernetes cluster?
Alex: Upgrading Kubernetes is hard. Every time the Kubernetes API changes, some things in your application will need changes. Large parts of the stack might need changes. To address this, I'll take it back to application platform and platform engineering principles. You need to have an application platform that takes the toil out of testing the interconnected web of dependencies to ensure compatibility. It's not just about Kubernetes API compatibility, but also how each component in the stack works together, such as the service mesh with Keycloak and the observability platform. Having a pre-tested environment helps developers simply upgrade to the next version of the application platform, rather than figuring out the entire stack themselves.
Bart: Kubernetes turned 10 years old this year. What should we expect in the next 10 years to come?
Alex: Lots of exciting things are happening. There are still lots of security challenges as we continue to grow on the scale of Kubernetes. Tons of new technology are coming out with AI workloads, and some of the integrations that need to happen, both in terms of scheduling, workload distribution, the efficiencies of hardware, sustainability, and energy efficiency. Additionally, we need to figure out how to expand Kubernetes across clusters, dealing with multi-cluster sprawl, and the distribution of workloads at the edge and in various environments.
Bart: What is your least favorite Kubernetes feature?
Alex: That's a trick question. I love it so much.
Bart: Well played. And even trickier, you are a highly skilled technical person, but you're also one of the nicest people I've ever met in my life. What's your secret?
Alex: What's my secret? I think it's about being open. Despite the fact that we have lots of experts in this field, I always stay humble. There's always stuff that you can learn from people in the community, people in your team, and people in your company. Always be humble and be open to learning.
Bart: How can people get in touch with you?
Alex: I am on the CNCF Slack, as well as Twitter and LinkedIn, so feel free to get in touch.
Bart: And what's next for you?
Alex: What's next for me professionally is that I'm very excited to continue working on Akamai Application Platform, where we're building some of the more complex and scalable solutions for some of Akamai's largest customers.