The future of Kubernetes: From Gateway API to AI integration
In this interview, Lior Lieberman, SRE at Google Cloud, discusses:
Emerging Kubernetes tools he's watching closely, particularly KRO for managing Kubernetes and cloud resources as cohesive units, Argo CD for GitOps workflows, and Gateway API as a vendor-neutral solution positioned to transform networking in Kubernetes
The distinction between Crossplane and Terraform for infrastructure management, highlighting how Crossplane's operator-driven approach with reconciliation loops provides "the missing piece for end-to-end developer independence" when combined with CEL validations
The evolution of service meshes and how ambient mesh addresses adoption barriers by separating functionality into L4 and L7 layers, predicting increased adoption as vendor-neutral APIs like Gateway API reduce lock-in concerns
Relevant links
Transcription
Bart: All right, Lior, welcome to KubeFM. First and foremost, who are you? What do you do? Where do you work?
Lior: Thanks for having me. I work for Google Cloud. I've been doing a lot of reliability work on Google Compute Engine (GCE) and Cloud Service Mesh, focusing on vendor-neutral APIs.
Bart: Now, what are three emerging Kubernetes tools that you're keeping an eye on?
Lior: I think the first one is KRO, which was recently launched. I have high expectations for it. It lets you define a resource called resource graph definition that manages Kubernetes resources and cloud resources as cohesive units. This can be a great improvement for developer independence and removing bottlenecks from infrastructure teams.
I also keep an eye on Argo CD, which is very close to my heart. Back when I worked at Riskified in 2020, we were one of the early adopters of Argo. I think it has a great user experience, and I'm watching its advancements closely.
The third technology I'm excited about is Gateway API. We did a gateway for Mesh and a recent gateway inference extension. I'm excited to see these pieces coming together to provide a vendor-neutral API experience for users.
Bart: To touch on that further, when we interviewed 30 CNCF ambassadors last year about their least favorite Kubernetes feature, network policy management and networking in general were mentioned. What role do you think Gateway API is going to play?
Lior: I plan to keep an eye on network policies. I think Gateway API, probably in conjunction with network policies, has a lot of smart folks exploring identity-focused network authorization. Gateway API is positioned at the center, potentially owning the APIs. Most implementations of Gateway API, mesh implementations, and north-south implementations will likely implement this. Obviously, many CNIs are welcome, but when we go one layer above L3, it gives us the opportunity to get more information about identity, whether from a TLS handshake or MTLS certificates. I do see the API positioned at the center of this in the near future.
Bart: Taking a look at a couple of different things that came up in some of our podcast episodes: Dan Garfield said that using Kubernetes as a central data store allows tools like Argo to detect and sync drift in your infrastructure. In comparison, tools like Terraform externalize their state and are harder to track. Is the market moving to tools like Crossplane to provision infrastructure and away from Terraform?
Lior: I can't tell if the market's moving or not. To the best of my knowledge, I don't think Crossplane is widely adopted yet. However, there are distinct use cases for Terraform and Crossplane.
I like Crossplane because it provides an operator-driven way to manage infrastructure as code. The reconciliation loop that comes with this pattern does not exist in Terraform out of the box, which is quite nice. In my mind, Crossplane is the missing piece for end-to-end developer independence, especially when combined with CEL validations on composition resources.
One use case we built at Riskify was providing full independence with Helm and Argo CD for developers. We would develop basic Helm charts, and they would combine them using umbrella charts and their application. The missing piece was how to define IAM roles, S3 buckets, Kafka topics, and other resources.
While you could use self-service methods like Slack or web pages for automation, we envisioned having IAM roles and S3 as resources. This approach lets us create a Helm chart as one cohesive unit containing everything the application needs. I believe this is a true game-changer for developer independence in organizations.
That said, Terraform remains valid for many cases, such as building VPCs, subnets, and foundational configurations. You would likely want to combine this with drift detection tooling. We also did an interesting project with Crossplane for cluster provisioning, but we can discuss that another time.
Bart: Regarding service meshes, William Morgan explained that ambient mesh is a viable alternative to having too many sidecar containers in service meshes, but it comes at the expense of not having the pod as a single independent unit. What are your thoughts on service meshes? Should you use one or not, and when should you use them?
Lior: I think there's a delicate balance between having small enough microservices and understanding that when you have many, migration becomes painful. It reminds me of the question: How early should I bring a platform or DevOps engineer into my organization? How long should I rely on an infrastructure-aware developer?
I like service meshes and believe they provide real value when you have a decent number of microservices. They help you understand request paths through your cluster, care about security, and verify connections are mTLS encrypted. They enable identity-based authorization policies instead of IP-based ones, and provide advanced L7 routing features like canary splitting and rate limiting—all without changing application code.
In the past, service mesh adoption was hindered by vendor lock-in. However, with Gateway API and gateway mesh APIs, this concern is diminishing. Users value portability, and I anticipate increased service mesh adoption.
This isn't necessarily because sidecar-less approaches are inherently better, but because many users were deterred by the overhead and cost of deploying sidecars across their entire fleet. Ambient mesh has addressed this by breaking the problem into two layers: an L4 layer focused on zero-trust features like mTLS, authorization, and telemetry, and an L7 layer for advanced routing.
Going forward, we'll likely see more organizations adopting service meshes using vendor-neutral APIs, moving beyond traditional sidecar-based implementations.
Bart: One of the topics that's come up a lot in the last year is LLMs and AI making its way into the Kubernetes ecosystem. However, it feels like we haven't had as many concrete examples as we would like. One of our guests, Brian, noted that we're starting to see LLMs do Kubernetes configuration generation. How do you envision AI shaping Kubernetes configuration in the near future?
Lior: I lost track of all the AI work streams going in parallel. There are a lot. DRA is obviously positioned at the center of enabling GPU allocation, device management, and expanding for networking use cases right now.
There's the Gateway Inference Extension that was just landed a few weeks ago. This is a really exciting enhancement for those with inference needs in Kubernetes, especially those with lower adapter usage. It increases throughput by smart routing to pods with the correct load adapters.
I hope to see this extension evolve with more people bringing diverse use cases. In my opinion, there are tons of areas where it could provide real value. At Keep Consulting, we did a proof of concept that provided a way to use AI to validate networking configuration and stress-test network policies, understanding drift—kind of like a chaos-driven experiment.
Looking forward, I envision AI ML tools will be used for right-sizing workloads, potentially replacing VPA and HPA. There will be tools for debugging and troubleshooting in Kubernetes. As an SRE, one concept I've been exploring is advanced anomaly detection. Often, we ignore 400-level errors that don't count toward our SLOs. Imagine an AI model that could detect anomalies in 400 errors, potentially identifying backend developer issues rather than user problems—providing real value towards reliability.
Bart: As Kubernetes turned 10 years old last year, looking towards the future, what do you expect to happen in the next 10 years?
Lior: I think we are going to see more infrastructure improvements, particularly in GPU allocation and resource allocation, to ensure Kubernetes continues to be the driver for these technologies. Inference is an area of focus, with many people opting for inference in Kubernetes. I anticipate training will also get a fresh look. We'll likely see more development through extensions and sub-projects, with less development in Kubernetes core, which provides a better way to iterate. Multi-cluster is another area I'm eager to see grow.
Bart: What's next for you?
Lior: I am planning to focus on network policies and authorization. I know there's been a lot of confusion among users and other ambassadors, and I believe we could provide significant UX improvements. Obviously, vendor-neutral APIs in service mesh are something I want to see growing. There are already many projects adopting vendor-neutral APIs, such as Gateway with first-class support, projects like Linkerd, and Istio Ambient, which have API support for some resources.
I also plan to keep an eye on and get more involved in AI networking-related topics, specifically AI and reliability on Kubernetes. I think there's plenty of room for improvement in this area, and I intend to remain actively involved.
Bart: How can people get in touch?
Lior: LinkedIn and Slack are the first go-to platforms. You can search by my first name and last name, and I also leave my email below. That's another good option.