Evolving Kubernetes platforms: From infrastructure as code to developer experience

Evolving Kubernetes platforms: From infrastructure as code to developer experience

Guest:

  • Arshad Sayyad

From infrastructure as code to autonomous systems: exploring the future of Kubernetes platforms.

In this interview, Arshad Sayyad, Chief Business Officer and co-founder of StackGen, discusses:

  • The evolution beyond infrastructure as code toward autonomous infrastructure using AI/ML for predictive scaling, policy engines like OPA and Kyverno, and serverless architectures to reduce operational overhead

  • How to balance standardization with flexibility when building platforms for multiple teams through hierarchical namespaces, policy as code, self-service portals, and immutable infrastructure

  • The growing challenges of multi-cloud environments and regionalization as organizations need to maintain consistent platforms across multiple cloud providers and geographic regions due to data residency requirements

Relevant links
Transcription

Bart: So, first things first: Who are you? What's your role? And where do you work?

Note: While the transcript doesn't contain any specific technical terms that require linking based on the provided links, I noticed the company StackGen could be hyperlinked.

Arshad: Hi, Bart. Thank you very much for having me on the podcast. I really enjoyed our last conversation. My name is Arshad Sayyad. I am the Chief Business Officer and a co-founder of StackGen, a generative infrastructure platform company.

Bart: Very good. Now, what are three Kubernetes emerging tools that you are keeping an eye on?

Arshad: The Kubernetes world is emerging and growing rapidly. I think one of the things we're excited about is the scale at which Kubernetes has proliferated across the enterprise. There are lots of fantastic open source tools being leveraged by the community. The one we are really excited about is Crossplane because it provides a wonderful abstraction layer for both platform engineers and developers to amplify their impact within the value chain from a cloud deployment perspective.

The other tool we've been watching, which still has more runway, is Backstage. There are similar platforms emerging in the marketplace like Cortex and others. But Backstage, from an internal developer portal perspective, is something we are watching very closely. As more platform engineering teams emerge, internal developer portals (IDPs) are going to become more important. However, we still haven't seen the hockey stick growth for a tool like Backstage.

Bart: One of our podcast guests, Brian, suggested that the current infrastructure as code paradigm has reached diminishing returns. What new approaches could revolutionize infrastructure management in Kubernetes?

Arshad: Infrastructure as code has penetrated the enterprise world quite rapidly over the last seven or eight years, whether it's Terraform, CloudFormation, or other tools. But in Kubernetes, things get complex very fast. You start to see the limitations. Consider this: you've got all these YAML files, and when you try to manage them with GitOps, figuring out the combined effect of all these configurations can be a nightmare. Plus, Kubernetes is so dynamic. People make changes outside of the CI/CD pipeline, and suddenly you've got configuration drift—it's really a headache.

So what can we do about it? I think we need to move towards more of an intelligent and adaptive systems world. From a StackGen perspective, we believe we are eventually going to move to an autonomous infrastructure world in the next two to three years.

Here are some ideas: First, declarative intent with policy engines. More and more policy and risk compliance has to move to the left. Instead of writing scripts that say how to do something, we should just declare what we want the state to be. Tools like OPA or Kyverno can then enforce those policies automatically. Within our platform, we actually enforce over 300 risk compliance policies automatically. Imagine saying all deployments must have resource limits, and the system just makes it happen—no more manual checking.

The second idea involves AI/ML for predictive scaling and self-healing. Kubernetes generates a ton of data. Why not use AI/ML to analyze that data and predict when we'll need more resources? Or better yet, let the system automatically adjust resources before we even notice a problem. We're a little away from that becoming a reality, but we're heading in that direction quickly. Self-healing is another key area—if a pod crashes, the system should automatically know how to recover.

The third area is service mesh integration. Service meshes like Istio and Linkerd are becoming essential for managing microservices. We should treat their configurations—routing, security policies, traffic management—as code. This gives us better control and visibility.

Kubernetes is designed to be extensible. We can create Custom Resource Definitions and Operators to automate application-specific tasks. Have a complex database deployment? Build an operator to handle backups, failovers, and upgrades.

Finally, serverless architectures. Today, serverless is on the edge of most enterprise architectures, but it's starting to proliferate. Tools like StackGen and Knative are taking things further by abstracting away the underlying infrastructure. You should just focus on your functions and services, and the platform handles the rest. This can significantly reduce operational overhead.

Bart: One of our previous podcast guests stressed that standardizing everything makes cluster management easier. What's your advice for building platforms that several teams can use in an organization?

Arshad: This is an interesting topic. I think there will always be two schools of thought. One school of thought would be that standardization makes things more templatized, slower, and may not provide agility—"let me do what I want to do." Typically, that's what most teams think about as they're going through growing pains.

But standardization is crucial when building a platform for multiple teams. I've always believed that standardization helps you gain more velocity in a multi-team environment. You must make things consistent and manageable while not stifling innovation. It's a balancing act.

Here are some ways to do that:

  1. Hierarchical namespaces: Kubernetes namespaces are great for isolating teams, but they can become flat and hard to manage at scale. Having a hierarchy really helps create a namespace structure where you can delegate control to teams while maintaining policy control at the top level.

  2. Policy as code: Centralized policies are essential but shouldn't be a bottleneck. Tools like Gatekeeper and StackGen define policies as code and allow teams to request exceptions through pull requests.

  3. Self-service portals: These are becoming more prevalent. Using Internal Developer Platforms (IDPs) like Backstage or orchestration layers like Crossplane can reduce the platform team's burden and empower developers.

  4. Observability and cost management: Everyone recognizes the importance of monitoring and logging tools like Prometheus and Grafana. Ensure each team has access to dashboards and alerts, enabling self-service and cost control. This provides teams visibility into their resource usage for application optimization.

  5. Immutable infrastructure: Create teams that use containers and GitOps to deploy applications, or use an automated value chain like StackGen to deploy more consistently and reduce configuration drift. One of the biggest challenges for CIOs is configuration risks between runtime cloud and expected cloud configurations, which often cause production support issues.

Ultimately, you want to build a platform that's standardized yet flexible. Give teams the tools and autonomy, but provide guardrails and best practices to ensure stability and security.

Bart: You mentioned GitOps, and one of our guests, Hans, argues that GitOps is an excellent building block for building platforms with great developer experience. He highlighted the ability to merge, review, and discuss code changes and PRs, and the additional benefit of not granting permissions. Should all platforms use GitOps? What's your experience here?

Arshad: GitOps is fantastic. Most clients are starting to leverage GitOps significantly, both upstream and downstream. We integrate with GitOps, which provides an audit trail of all changes, allows review and discussion in pull requests, and eliminates the need to give everyone direct cluster access. These are huge wins that improve security and make rollbacks easier.

But should every platform use it? That's a more nuanced question. The pros of GitOps include great auditability, version control, and ease of rollbacks. By using it as a single source of truth, you avoid giving developers direct cluster access. However, the cons are that not all infrastructure changes fit neatly into a Git workflow. What about emergencies or quick rollouts? Legacy systems that aren't designed for GitOps pose challenges.

Most enterprises between $2 billion to $15 billion have about 30% to 40% legacy infrastructure. Organizations larger than that typically have 50% to 60% legacy systems that need to make changes quickly and directly. GitOps may not be the best approach in these cases. But for cloud-native workloads, it's a game-changer that significantly reduces operational toil.

Here are three suggestions:

  1. Hybridize edge cases: Use features like Argo CD, kubectl apply hooks for situations requiring imperative commands—such as legacy systems or ephemeral jobs that don't fit the declarative model.

  2. Security hardening: GitOps doesn't automatically secure your platform. Require signed commits using tools like Cosign, set up automated drift detection with Argo series comparison hooks, use SSO, and implement pull request approvals.

  3. Handle critical emergencies: In true emergencies, teams may not have time for pull requests. Create pre-approved break-glass roles for direct changes in critical situations. Use Vault or similar tools to manage temporary credentials for these roles.

Overall, GitOps can cut down operational work by about 40% or more, especially in cloud-native environments. It does require a cultural shift, but teams that embrace the pull request workflow can save significant time and effort.

Bart: In this paradigm of platform engineering, we're hearing more about thinking of platform as a product, where those building it are serving customers who are developers. How do you see this, particularly with the ideas of developer experience versus platform engineering? Are they one and the same? What are the folks building these platforms thinking about platform as a product, and who are they there to serve?

Arshad: I think this particular value chain is evolving quite a bit. Empirically, we believe there are probably four types of customer profiles:

The first is very large enterprises north of $15-$20 billion, with large engineering teams, DevOps teams, and platform engineering teams. For them, development engineering talent is a critical enabler and significant competitive advantage. They want to provide velocity, agility, and self-service, clearing the runway for their developers. The platform engineering team focuses on developer experience because their engineering talent is extremely valuable in the marketplace.

The second demographic includes firms between $2 billion to $15 billion that are still evolving—maybe 30-50% in the cloud, without all tooling in place. They typically have an engineering and DevOps team. Platform engineering is something they're considering, but they would ideally like to automate without significant investment.

Companies below $2 billion typically have thin infrastructure teams and are not very evolved from a DevOps or platform perspective. Frankly, they don't need to be, as they're using cloud platforms like Salesforce and NetSuite, primarily for standard applications.

The fourth demographic is digital native companies growing rapidly. They want to leapfrog DevOps and platform engineering, aiming for a no-ops world. Their goal is to shift power to developers and provide a stellar experience where developers can build, test, and deploy in minutes.

Platform engineering is prominent in large organizations, becomes more automated in mid-market enterprises, and digital native companies seek to automate the entire value chain through the latest tools and platforms, avoiding direct investment in infrastructure.

Bart: Arshad, what's next for you?

Arshad: We are excited about our journey. We are seeing tremendous growth with StackGen and scaling on multiple fronts. I'd like to share two things from a market perspective:

First, more clients are multi-cloud, and they need a platform to move workloads quickly between different clouds, depending on optionality, cost, agility, or available AI infrastructure. This trend will continue to grow.

Second, we're seeing significant regionalization. Most countries now require data to be locally resident. For a software or SaaS platform provider, imagine maintaining a platform in 8, 10, or 16 regions. One of our large clients must keep their platform available in 16 regions concurrently, harmonized and consistent across two or three different cloud providers like AWS, GCP, and Azure. It's a complex challenge.

Fortunately, we can help our customers automate most of this process significantly. This is something to watch in the marketplace, given the current geopolitical environment and emerging data privacy laws. Platform engineers must start thinking about these considerations.

Bart: And if people want to get in touch with you, what's the best way to do that?

Note: In this transcript, there are no specific technical terms that require hyperlinking based on the provided LINKS table. The text appears to be a generic question about contact methods. If the full context of the conversation were available, I might be able to provide more nuanced linking.

Arshad: People can email me at [email protected] or shoot me a message on LinkedIn or Twitter. I'm happy to connect with anybody. Thank you.

Bart: Fantastic. Arshad, thank you so much for your time today.

Arshad: Thanks a lot, Bart. Have a good day. Goodbye.

Podcast episodes mentioned in this interview