Platform engineering: learning from the Kubernetes API
This episode is sponsored by Sysdig — 5 Steps to Securing Kubernetes
In this KubeFM episode, Hans, a Principal Cloud engineer, shares his experiences empowering teams to use, build and manage platforms built on Kubernetes.
You will learn:
How OpenTelemetry and Prometheus shape cluster management and observability.
The role of tools like ArgoCD and Flux in enabling GitOps and streamlining deployment processes.
The significance of governance tools such as Gatekeeper and OPA for secure and validated resource creation.
The benefits of Custom Resource Definitions (CRDs) and operators in automating processes and enhancing the developer experience.
Transcription
Bart: In this episode of KubeFM, I get a chance to speak to Hans, a seasoned site reliability engineer. Hans shared his journey with me about OpenTelemetry and Prometheus, discussing how these tools have shaped his approach to cluster management and observability. We explore the concept of shifting down in platform engineering, highlighting how Kubernetes facilitates self-service infrastructure through a declarative API and fault tolerance. Hans sheds light on essential tools like ArgoCD and Flux, which enable GitOps and streamlined deployment processes. He emphasizes the importance of governance with tools like Gatekeeper and OPA to ensure secure and validated resource creation. We also touch on the power of custom resource definitions (CRDs) and operators in automating processes and enhancing the developer experience. Finally, Hans offers insights into emerging tools such as eBPF-based observability and VictoriaMetrics. Check out this episode for a comprehensive look at simplifying Kubernetes complexities and achieving scalable, reliable software delivery. We know that Kubernetes has become the cloud's de facto operating system, transforming application packaging with microservices. Yet at the same time, complexities often delay crucial security measures until production. With 70% of containers lasting five minutes or less, rapid detection of anomalies is vital. Time is critical for DevOps teams. Sysdig takes five steps to securing Kubernetes checklist, and learn how to become ready to tackle challenges at cloud speed. Now, let's take a look at the episode. All right, Hans, welcome to the KubeFM podcast. I want to start by asking you, what are three emerging Kubernetes tools that you're keeping an eye on?
Hans: That's a great question. Thanks for having me. I focus a lot on observability and how we build applications for Kubernetes. The three areas I'm focused on are eBPF-based tooling, like Groundcover and Retina, which help us get instrumentation without needing much effort on our side or the application side. And then... Tools like Kapp and related tooling around abstracting applications themselves, building a better language for that. Even though this one isn't terribly new, it's been out for a while, but it's slowly gaining traction: tooling around VictoriaMetrics and Prometheus alternatives, and how we think about metric collection in general and cardinality.
Bart: Okay. One of our guests talked about how she preferred OTEL over Prometheus. What's your experience been there?
Hans: So I think I agree. I really like OTEL. I like that it is not just metrics. Prometheus does metrics really well, but it also is kind of a single box and sharding and all of that gets really complex. And OTEL kind of makes that a data pipeline. Kelsey Hightower, I once listened to him talk about specifically managing Kubernetes, managing telemetry data. He said, once you start thinking about your stuff as a data pipeline, it will start making a lot more sense because then you're just moving data from here, transforming it, and putting it over here. And so all of a sudden, using OTEL to do that, it makes it far easier, far simpler, and you can have the same pipeline for all of your telemetry data.
Bart: In terms of context, can you tell us a little bit about what you do? We know about the tools you're interested in, but how do you apply them? What kind of stuff are you working on?
Hans: Sure. I'm actually a staff-level SRE, helping build Kubernetes platforms for teams. I work at both large and small companies, spending most of my day improving how we manage Kubernetes clusters. A new and emerging area is managing multiple clusters, sometimes 50 or 100, as needed. I focus on observing what our Kubernetes clusters and services are doing, working on production engineering, thinking about how we release our services safely and quickly, and determining the right level of abstractions for our developers. I've done this for large and small companies, including banks, satellite companies, healthcare, credit cards, and various smaller niche startups. All right.
Bart: You've really touched a lot of different areas in that regard. What about your experience getting into cloud native? Tell me about that.
Hans: I discovered that I was by nature a lazy person. I started off as a tier three help desk for big ISPs and telcos doing business stuff. I moved into a small software company as a sysadmin managing their data center and they had a colo. We racked and stacked servers. I managed all their networking. I built their servers, handled the imaging and all of this. This was back in 2014. Clouds had started to take off, but they were the furthest thing. We were still using PXE Boot to manage all of our stuff. We were doing things like that. I discovered that I was by nature a lazy person. I didn't want to have to do things more than once. I learned to program because I didn't want to do things more than once. The way to solve that in the computer world is to learn how to program, learn how to script, and take something that was me sitting there clicking through a UI to install software, install an operating system to, cool, how do we do a PXE boot across the network? How do we start setting up specific drives without having to manage hosts directly? Things like that. Having spent a year or two building out this private cloud in the colo, it was pretty easy to step into the cloud world and have a much better understanding of the challenges, benefits, and pitfalls of moving to the cloud. I had built this other one for this company on Proxmox on our own private little thing and that gave me a ton of great insight. I then used all those programming skills to never have to touch a server again.
Bart: It's the first time we've heard someone approach this from the idea of being lazy, but I really like how you framed that. How do you work yourself out of a job? In terms of the Kubernetes ecosystem, now that you're working on Kubernetes platforms, how do you stay up to date with all the changes that are going on constantly? What works best for you?
Hans: I mean, sometimes I think we give the front-end JavaScript world a run for their money on how many updates we ship and how many new tools. I tend to read a lot and listen to podcasts. So I tend to be able to read while I'm doing other things. I have a couple of recommendations. There's Platform Engineering that does a weekly one. There's a DevOps Weekly that both kind of help out. There's a Substack or a thing that goes out from... Quick Early also about excellent engineering and what that looks like, which helps me keep track. I also spend time listening to talks from reInvent or KubeCon and things like that. But the honest answer is no one can keep up with all of it. So, yeah.
Bart: I agree. Still, finding those go-to resources and what works best for you is important. I really liked the JavaScript example. That's something that had been mentioned before. I think it's a worthy example. It's worth keeping in mind. If you could go back in time and give one piece of advice or a career tip to your younger self, what would it be?
Hans: I think taking more risks. When you're a junior engineer and just starting off, it can be hard to feel empowered to take risks on what you're proposing, what you're solving, and how you're doing it. If I could go back and talk to myself in 2014 or 2011, I would say, take more risks—not necessarily in who you work for, but in the projects you take on or how you approach those projects. Don't get stuck in thinking you have to do it a certain way because someone in the company said so or because it's always been done that way. Take the time to see how you can step out and improve. Do this with some caution, obviously, but in general, junior engineers, myself included, tend to focus on how to get things done within the narrow context given to them versus solving the problem in the best way. This could be in a way that will scale and perhaps be new and inventive, which can be applied to other resources or areas as well.
Bart: As part of our monthly content discovery, we found an article you wrote titled "Platform Engineering: Learning from the Kubernetes API." We want to dive into that a little bit deeper. You've been working in cloud and infrastructure for over 10 years. You mentioned going back to 2014. Walk us through how things have changed since you started.
Hans: When I first started back in 2011 and began doing stuff even before I actually got to put hands on the keyboard in the corporate world, we were beginning to see cloud take off. And we saw none of... none of the tools that we have today. Kubernetes in 2011 would not be released for another year or so, or officially come out. We had none of the AWS tools. AWS offered EC2, S3, and RDS as primary services. All the rest of it was barely starting to flush out. All of the tools that came next were about how to properly abstract things we were doing on EC2. How do we abstract things we were doing in RDS and S3? For a long time, we had many conversations about shifting left. There was a whole DevOps revolution about how to shift things left in the CI/CD pipeline to move it closer to the beginning so that we learn about problems earlier in the process. We empowered developers to make changes. That went really well. It obviously abstracted a lot of things. We became a lot more productive. But... Towards the end of the 2010s and into the beginning of this decade, we saw that it doesn't work all the time and adds a lot of complexity for developers to interact with all of these things. Google's Richard Serator put out an article mid last year that was all about shifting left is for suckers, shift down instead. It described a new paradigm, and I really latched onto that because it described a similar challenge I had encountered multiple times over the last decade. Developers don't need more complexity. Developers care about taking their code from their laptop to production as quickly and safely as possible. They don't want to manage ports on a Kubernetes service. They don't want to manage how big their RDS instance is or how much memory they need to assign. They don't want to manage those things if they don't have to. There are times when you have to, but if they don't have to, they don't want to. Shifting down versus shifting left acknowledges that and describes a model where instead of taking all of the customization options possible in the cloud, which are infinite and vast, and just moving them left in the process, it says, okay... How do we give you an abstract, opinionated version of this that allows you to quickly iterate? You should still be able to tweak knobs and turn things when you need to, but 95% of the time, you don't need to. You can just ship an app. We've seen this take other forms. People sometimes call this a golden path or a paved path approach where a platform team says this is how you do things most of the time. Being able to abstract what you're doing and increase your layer so that people don't have to think about it is the principle of shifting down. The primary change was, how do we go from everyone gets everything late in the process to now we get everything early in the process? That was shift left, finding problems early. Now we're moving to, how do we abstract this so that you don't have to think about it if you don't want to?
Bart: The industry is moving, as a general trend, more and more towards self-serve. We see that as a way for companies to create fewer dependencies between teams. Even roles such as SREs blend the skills of developers and operations even more.
Hans: How does shifting down help with self-service? Every developer and virtually every company I've ever talked to has been like, "Why can't our clouds just be like Heroku? Why can't we just use Heroku? All I want to do is put my code in a thing and have it serve and scale." Heroku doesn't scale the way you necessarily need it to in an enterprise environment, but there's something there that says, "Hey, we want to be able to have the smallest amount of scaffolding and extra stuff to be able to ship our code." This is one of the reasons why I think Kubernetes has been really successful. The Kubernetes API allows you to provide a common interface. Even inside Kubernetes, we've built these abstractions. We take a pod into a replica set or a stateful set, and then a replica set into a deployment, and then a deployment into something higher that you're abstracting as a platform team. From an SRE perspective, for example, we're blending developers and operations where we have an understanding of what developers want to achieve, what their developer experience concerns are, and how to implement those in the Kubernetes API or any API in an effective way. How do we deliver a Heroku-type model for engineers so that they can shift? Quickly, safely, and efficiently, because no engineer wants to get stuck in a security review process or an architecture discussion about how many pods or what metrics should we scale off of or things like that. This is, again, like I talked about in the article, where the Kubernetes API really shines. It lets you take all of these abstractions that we have, a service and a deployment and all of these custom operator resources. Then you can write your own operator. Or you can use something like Crossplane to abstract and build out. Or you can use something like Kapp to build these out and say, "Hey, I want to deploy a version of this microservice." Then something else does the interpolation and the logic behind how we actually ship that and reconciles it so that it's always there, so that it doesn't disappear on you just because you deleted a pod, and makes intelligent decisions and has correct defaults and all of that. Shifting down should help you let developers make informed choices, but make informed choices quickly. That is the primary goal there.
Bart: Considering that Kubernetes isn't exactly the easiest platform to learn and something that a lot of developers would probably rather just avoid entirely, how can Kubernetes help with self-service then? What's the role of Kubernetes here?
Hans: Kubernetes is actually both really, really simple. Like, hey, I just want to run this container. And if you're using it, you say, hey, give me this pod and it runs and works. But also from an administration standpoint or from a, how do I define that container? How do I make all these choices? It becomes, like you mentioned, really, really complex really quickly. And so there are a couple of things that become really important. So when we think about the Kubernetes API, the first thing we think about is that it's declarative. I, as a developer, don't have to say, first do this, then do this, then do that. I can simply say, here's my deployment, run it. And Kubernetes will take care of that. And if I want to change something in my deployment spec, I can go change that. The same is true for other custom resources that we're interacting with in Kubernetes and that we're managing. It's a very declarative system, which helps remove some of the complexity. I can't even imagine how complex Kubernetes would be if you had to do imperative design and scripting of the various parts. The second is idempotency. You can submit the same request over and over and not have a problem. This enables things like GitOps that allow us to use a tool like Argo CD to abstract some of that complexity and be able to ship it all on a regular basis. and say, hey, I want to put this manifest in here. And it can sync it over and over. It can validate that it's still going. And every time you do it, you're not worried about breaking your service. I've written plenty of Ansible and Puppet and Chef and even Terraform and other configuration languages or infrastructures, code languages. And they all run into a very similar problem of at some point, someone is going to write a little loop that is going to break because it's not actually idempotent. It depends on something else being there. It depends on a request in a particular order. Things like that. You don't have to worry about that with Kubernetes. You say, I want these three things. If two of them already exist, it only creates the last one. And everything works and handles that. And the third one is fault tolerance. Kubernetes is not just a Docker Compose stack running on an EC2 instance. It allows you to scale past multiple machines, and from a developer experience, really treat all of the compute, all of the memory, all of the storage in your Kubernetes cluster as fungible. You don't have to worry about, can this VPC talk to this VPC? If the Kubernetes cluster spans it and has the routing set up, then all you have to worry about is let me call this endpoint, let me talk to this service, and cover how I go from service A to service B, and how I deploy those out and how I manage them. And Kubernetes isn't the only tool here. You could try to build, it has certainly benefits, but there are other container orchestrators that do similar things. But I think the big benefit that Kubernetes gives you is, hey, I can extend this. I can write my own custom operator that does the idempotent thing or the declarative thing and builds that out so that a developer doesn't have to think about it. You can write a Helm chart, which is a nice little API for how to deploy a particular service. There are all these very, very useful tools that can be built on top of it because of these three core principles of the Kubernetes API.
Bart: You mentioned talking about infrastructure as code and Terraform. Some properties, such as idempotency, are found in other tools, with Terraform being one of them. Does that make a tool like Terraform suitable for self-service?
Hans: So kind of. I think the big problem, I have written Terraform now for just under a decade. I've written a lot of Terraform. I've written a lot of configurationless code. The problem that comes into play with Terraform or any configuration language is the lack of reconciliation and the ability for stuff to drift. There are tools that help you solve this, like Atlantis, Spacelift, Terraform Cloud, or even if you use Crossplane or a Terraform operator inside Kubernetes. But basically, the fundamental problem remains that with Terraform, you need something external to the Terraform itself running the Terraform on a regular basis. There is no standard for that. There is no standard API. There's no standard runtime. It's very easy for you to write the best Terraform in the world, deploy it, and then someone goes and clicks a button in the UI and breaks all of it. Without intending to, because there's no reconciliation, there's no ability to easily write tooling that says, "Hey, you shouldn't do this," or "You can't do that." For Kubernetes, another area where it shines is I can write tooling that I know will parse the spec of the Kubernetes API resources. All of those CRDs are open API spec resources. I can write CI tooling that says, "Hey, you probably shouldn't be doing this," or "We're not going to let you do this." I can write things in Kyverno or Gatekeeper that act as an admission webhook that says, "Nope, you're not allowed to do this." Or, "Yes, you're allowed to do this, but we're going to modify this." The concept of webhooks, admission, validating, and mutating webhooks in Kubernetes allows you to take what someone meant to do and improve that semantically and meaningfully. You can't do that with Terraform very easily. There's no way to abstract on top of it. The state file is contained separately, and you have to start referring to other objects and those permissions in S3 buckets. Terraform is just a language. It's not a runtime. It's not an API server. It's none of those things, and Kubernetes really shines when you hook all of that together.
Bart: You can also have state drift in Kubernetes if you use kubectl and don't store your YAML file in Git.
Hans: Absolutely. I once joined a company as they were using Kubernetes and I was like, how do you deploy your software? And they go, no, we just log in via the command line and run the helm upgrade or kubectl update. And that was including in prod. And so if someone remembered to go do that, that was basically how they handled that. That's absolutely true, but there are two big differences. The important part is to recognize where drift is happening. There's drift from what you have declared in your code and submitted to it. And then there's what's in your Git repo. And then there's drift from what you have submitted to a reconciliation API server or something like that. Kubernetes really solves the second one conveniently, right? If you tell it, hey, I want a deployment with three pods, it will do its best to put three pods out there and run those three pods. To your point, you still could have drift. I could go into the Kubernetes server and delete that deployment, and then that deployment won't exist anymore. Kubernetes will be like, oh, I don't need that deployment. So then we invented this concept of GitOps. We were like, hey, we are no longer just uploading files for our services. As we write our web application and our application code, we're no longer just uploading that to S3 and hoping people can get it and then deleting the files. We're no longer just SFTPing it into a server and copying them over and being like, cool, we updated our stuff. And that's kind of the equivalent, right? The equivalent to using just kubectl is just like SFTPing into a server and doing some files from your local machine. We said, hey, we're using a VCS or a version system for our application code. We should use that for our infrastructure code. And if we're using it for infrastructure code, we should just have something that listens and says, I need these objects in my cluster. Here's my desired state, which gives you all the benefits of PR. A couple of really good tools for that for Kubernetes are ArgoCD and Flux, but there are other ones. They have their strengths and benefits. The big driver for both of them is you remove this drift. You run a single set of pods or a small set of pods inside your cluster. And all of a sudden, everything you do for all of your infrastructure can happen via PR, via the regular approval process. It can extend past your Kubernetes cluster. One of the big benefits is there are tools like Crossplane or Config Connector for Google that allow you to manage your cloud resources from your Kubernetes cluster. You can then start writing a consistent developer experience for, hey, I need to create a database. Instead of having to write a developer experience and security pipeline for both Terraform and your Kubernetes resource. You can do it for one because they can just declare their need for a database as an RDS instance or a cloud SQL instance as a Kubernetes resource. You can then apply the correct transformations and security rules to that. And then it goes and creates it. All of a sudden, you know, not only are you managing in-cluster resources, you're managing out-of-cluster resources. The developer experience is better. And you're doing it in a very controlled and compliant manner. Sure. I think GitOps is amazing for when you want to be compliant with various frameworks. Because you can say, no, nobody can make a change outside of this PR being properly reviewed. When it is reviewed and merged, then that change goes and happens. You can scope all of the resources and permissions, and no one needs write access to a UI.
Bart: You mentioned developers and developer experience, focusing on that point for a second. If we already have Kubernetes and then add additional tools like Argo CD or Flux to facilitate syncing resources, it can become quite a slippery slope from here. Then we'll have Prometheus, Istio, OpenTelemetry, and all the other tools we've been speaking about already. Isn't this the reason why some developers are drowning in complexity to begin with? Do you feel like we might be setting developers up for some kind of self-sabotage with the amount of tooling that goes into it?
Hans: This is a great question. It really gets to the heart of the difficulty of running software at the scale that we do today. Twenty years ago, no one cared whether your software was down for a few minutes in a day. They would just refresh their browser. But now if your software is down for a few minutes, that could mean that somebody's emergency call doesn't go through, somebody's healthcare data doesn't get uploaded, or someone can't communicate critical life-saving information. Or it could cost you half a million, a million, or a hundred million dollars, depending on how big your company is. There's inherent complexity and added complexity. If we say, "I want my service, my application to run at this amount of uptime," then there are inherent complexities. To achieve that, you have to monitor your systems, release safely, and handle an availability zone going down or being unavailable. These are inherent to your desire to run software at a certain availability. If you want to monitor your software and know when it goes down, you need some kind of monitoring software. You could run something like Zabbix or Prometheus, but something has to be running to monitor that software. If you want to handle an availability zone going down, you need something that handles the shifting of a pod from AZ1 to AZ2. Something has to handle that. If you want to handle certificate renewals, something has to do that. There's inherent complexity in the leveling up of our software writing skills and the demands we have on delivering software in a reliable, scalable method. Those demands have inherent complexity. I have found that Kubernetes is the best way to abstract those complexities. At a former company, we wrote a Helm chart that every service—250 to 300 microservices—all used. Why? Because it gave them a single values file to interact with. They got a service monitor by default, canary releases by default using a tool called Flagger, a CloudSQL database set up correctly with a CloudSQL proxy, the correct service accounts, and the correct IAM permissions. All of this was wrapped up in a single Helm chart that someone could interact with. If you start separating those pieces of tooling into different platforms besides Kubernetes, you can do that, but you need something that interacts with all of those to provide a consistent developer experience. In my opinion, that is way more complex. There's always a question of when you should introduce those pieces of technology. For example, if you need mTLS, maybe you don't go straight to Istio. Istio is a big, complex thing to manage. It has amazing features. I love Istio. But it is very complex, especially to administer and run. Maybe you step to something like Linkerd that is expressly for mTLS. It's a little easier to set up and requires less complexity for the developers. At a later point, you can move to Istio. The same thing applies to monitoring. Maybe you start with Prometheus. Prometheus comes installed in my EKS cluster as an add-on. Then you can move to something like OpenTelemetry. Or maybe you start with kubectl apply and Helm apply. Then you move to GitOps once you have more than one or two people interacting with your cluster. There's always a give and take to when you introduce that complexity. In my experience, it's less the ecosystem that you're in that adds complexity and more the underlying problem you're trying to solve: delivering reliable, scalable software automatically. How do you do that? You need a lot of these tools.
Bart: I think that's good. It's not so much about freaking out about the problem, just getting the right framing on the solution. That's a refreshing take. Are there other aspects of Kubernetes that you think work well with self-serve?
Hans: Like I mentioned earlier, the ability to abstract really... human manual processes. For a long time, at the former company that I worked for, Capital One, and then at Mission Lane, we ran Elasticsearch for logging, search capabilities, and a variety of other things. At Capital One, when I was there, the team I was on ran Elasticsearch on EC2 instances provisioned by Terraform, using a variety of scripts and auto-scaling groups to manage the cluster. At Mission Lane, we ran a much larger Elasticsearch cluster that was handled automatically. Automatically because the Elastic operator exists. Elastic had written a tool that lets you say, "I want an Elasticsearch cluster." It handles the deprovisioning of a node, the movement of data, the segregation of data, the sharding of data, RBAC, certificate provisioning, and all of that. You say, "I want this cluster." And it goes and does that. That's where Kubernetes shines. The ability to take what otherwise is a bunch of manual processes, imperative scripts written in Terraform or Ansible, and say, "I can genericize this, put it in an operator, and then say I would like an Elasticsearch cluster." Another good example is the VictoriaMetrics operator or the Prometheus operator. You can go to one of those and say, "I want a Prometheus instance" or "I want a VictoriaMetrics cluster." It will give you the VictoriaMetrics cluster. It will give you the Prometheus instance with all the correct configurations. It takes care of all the previous manual steps, which is where the term operator comes from. It's replacing a human operator with automation. And I think that really shines.
Bart: Those are very good examples of how to take something that is really complex to operationalize if you're not aware of it and instead make it a single, easy way to interact with, scale, and manage that resource. The people who write the Elastic operator have a far better understanding of when to call an API endpoint to deregister a node, etc. You get the idea. And, mentioning resources, someone said that CRDs, custom resource definitions, are one of the best features of Kubernetes. What's your opinion? Do you agree or disagree? Is there any other feature you like more?
Hans: No, I mean, I think this is by far my favorite feature. This is the feature that separates it from something like Nomad or something else—the ability to write these custom resources, have an API, and an operator that says, "Hey, here's a spec." I want to be notified when this thing happens, and I will provide that consistent messaging and all of that. This is the big difference here. This is what allows you to abstract. It allows you to move faster than the Kubernetes core team, who by themselves move very quickly, right? We have a six-month release cadence for Kubernetes. Or three months, whatever, quarterly or bi-annually, I don't know. Anyone who keeps up with that is doing better than we are. But CRDs allow you to move faster and be able to say, "I want to do this other thing that isn't intended." I want to be able to do this new and unique, provide this abstraction to my developers, to other engineers, or to other users in a way that we can't. If you're not using Kubernetes for this, you have to stand up some lambdas, or you have to have your own UI, you have to build a CLI to interact with your stuff. There are all kinds of different things. AWS doesn't allow you, for example, to easily build abstractions on top of their infrastructure. You have to stand up your own EC2 instances to listen, provide a hostname for people to call in. You have to write your own CLI. Kubernetes gives all of that for you. You just say, "Here's my CRD." People can now interact in a common language. You can get webhook validation and all that. Anyway, CRDs are my favorite. I mentioned in the article using Crossplane or Config Connector. These are some of my favorite examples of this. It is mind-blowing, especially to someone who's been around for a while, that we can, in the Kubernetes cluster, say, "Give me a CloudSQL database," and have that CloudSQL database just pop up in Google Cloud properly, or an Elasticsearch cluster, or an S3 bucket, or all of those things. This means I can abstract all of my app concerns, put all of that infrastructure together, and I don't have to worry about writing Terraform for one thing, Python for another, Ansible for another. I can just write Kubernetes. I can ship containers that are going to work and know that my CRD and my operator will take care of it. Obviously, from the management of the cluster side, CRDs and operators take more work because we have to manage them as a platform. But from a developer experience, I think there's very little that is at that level.
Bart: From what I gather, being able to self-serve and create CRDs comes with great power. I assume that teams doing this must also bear significant responsibilities.
Hans: As a sysadmin, it's always a little scary to give developers the ability to create IAM roles, S3 buckets, RDS instances, or potentially even more complex pieces of software on a self-service basis. Fundamentally, they are also engineers. They have slightly less context than us sometimes about all of the considerations that go into it. They have slightly different incentives, but they're also engineers. We trust them; they're delivering the product. The way to do it in Kubernetes is governance. Use a tool like Gatekeeper and OPA or Kyverno to limit and restrict what things you don't want them to do. If you do that, you can provide consistent tooling for all of the cloud resources. A common issue that AWS has now fixed, but for a long time, if you created an S3 bucket, it would be public. If you don't have a CRD or a Kubernetes pipeline or a data pipeline in general for creating those S3 buckets, there's a lot of tooling that was built. This tooling was built to manage that, to make sure developers don't click the wrong settings, exposing an S3 bucket with all of the company's information. OPA and Kyverno solve that inside the Kubernetes world. They say, just write a policy that blocks that and says you're not allowed to create the object with that configuration. Capital One, as an example, wrote Cloud Custodian. It's a massive project that runs in Lambdas whose sole purpose is to catch these misconfigurations. This was before AWS released AWS Config and some of these other tools. AWS released a bunch of tooling around this. You may still want to run them for compliance purposes, but from a developer experience point of view, those go away when you can just have a Kyverno webhook that says yes or no to your S3 bucket as you submit it. You can run... You can run Kyverno policies against or outside your Kubernetes cluster. You can write it as part of your CI pipeline when they make a PR and say, I want to add this S3 bucket as part of the CI pipeline. You can say you're going to be allowed to or you're not going to be allowed to create this resource. That is really powerful to be able to surface those policy violations and requirements in a consistent manner inside the cluster and outside the cluster to developers.
Bart: The things that you're describing sound very promising. Can you tell us how you've seen it in real life? Have you seen it in real life?
Hans: I have. I was lucky enough to work for Mission Lane with a group of very talented platform engineers and engineers overall. Like I mentioned earlier, we had between 200 and 250 microservices, all using a Helm chart that we had written. All of their Cloud resources for their applications and services were deployed via Kubernetes objects. We had Kyverno pipelines and admission controllers with policies that said yes or no to particular resources, checked for particular fields, and validated what we were doing correctly. We had sub charts to handle requests like, "I want a CloudSQL instance," or "I want a Google Cloud Storage (GCS) bucket." We had a template that allowed developers to clone a template repo and get a default installation of their Helm chart. They got all the default code needed to stand up a microservice and ship it in CI. For most developers, they interacted with that values file once or twice in a six-month period as they adjusted something or turned on a feature. We went through three major versions and 200 plus, 300 plus minor versions of that Helm chart as we iterated. We started small with a deployment, a service, and a CloudSQL instance, and then we iterated from there and improved it. By the end, all of our services were using that Helm chart. We used something like Renovate to submit updates for it. Developers didn't have to worry about most parts of their Kubernetes cluster working. They got their metrics monitoring and service monitoring for free. They could have their dashboards submitted to Grafana as part of that. All of these different parts improved. For example, we would auto-instrument with OTEL. When they got their Helm chart, we would ship the init container that would automatically install OTel into their container and add the correct arguments to run it. All of these different things meant that we could roll out tooling without developers having to worry about it or deal with the complexity. They could just opt in or opt out. They could opt in and say, "Yep, I want this piece," or "Nope, I don't need this piece." That was really powerful. We had teams shipping multiple times a day to their clusters or services. It was a lot of fun. It was really effective and efficient at what we were doing.
Bart: Does this Kubernetes API mental model apply to other tools?
Hans: We're seeing it more and more. And we're seeing... When Terraform first released, for example, it was very much a "just run it from your CLI, just run it from your local machine, don't worry about it." And then we've seen more and more tools get built out to support this, whether that's things like Terragrunt that allows you to abstract your Terraform modules. We've seen new configuration languages like PKL or Cue come out. We've also seen tools like Atlantis and Spacelift come out to handle the reconciliation and the GitOps model for Terraform. Kubernetes still stands above the others because of the ability for CRDs. Other tools can replicate all the other parts: idempotency, fault tolerance, and the declarative nature. They can even handle the reconciliation loop, but CRDs exist as a really big benefit, allowing you as a platform team to ship basically a Heroku-type model for your engineers. When you're building out a new piece of tooling, consider how to do this in... the easiest way for your developers. How do you make it extensible? How do you make it easy to interact with? How do you make it so that developers have to make as few choices as possible? Alternatively, they can make their choices as late in the pipeline as possible and be able to edit those and figure out what's needed later on.
Bart: Up until now, we've been discussing tooling. What about the engineers working in these teams? In your experience, do they want to self-serve? How do you evangelize these practices to other folks out there? I don't imagine it's as easy as just walking up and saying, "Hey, everyone, from now on, you're all on your own." How do you go about doing that? We talked about the technical challenges, but the organizational challenges can sometimes be even bigger.
Hans: No, you can't solve organizational problems with technology. And that just remains true no matter what you do. You can't just rock up and be like, hey, you're on your own, good luck. It's about building and demonstrating that developer experience. It's about showing them that this way is easier. I've seen this go right and I've seen this go wrong. I've seen it go wrong when you mandate the use of particular tools. When you say to engineers, you have to use our tools and you get top-down buy-in. If you have to compete for their usage of your tools, if you don't have a mandate that you have to use this particular set of tools, then all of a sudden this is easier because then you actually have a product that you're building towards engineers and you'll actually solve their problems. You're incentivized to listen to them, fix their pain points. Especially when you're first starting out, the goal is to go out and talk to them and see what is taking a long time. Are you, as the infrastructure team, for example, being a choke point in them getting their stuff out? Because you can't provision your RDS or your S3 buckets fast enough. Or you're going through too many rounds of security reviews. And you ask, do they want self-service? And the answer is yes. I have never met a single engineer in my entire life who, when offered, was like, I don't even want to care about how much memory my app uses, just deal with it for me. Everyone else wants to be able to control how their app runs, where their app runs, and what it interacts with. What they don't want is to have to become Kubernetes experts to do that. That is where you can't just be like, hey, here's Kubernetes, have fun. You have to think about it from their perspective and figure out a way to abstract what is going on and what is required. Start small. Start with one team. Think about how to build it for the others. Start with one team and slowly work your way through. Solve their problems. The best way to evangelize this is to not have to do it on your own but to get them to go out and sell it for you. For them to tell their other developers at that company, oh hey, we're doing it this way, it makes this really easy, you should go talk to the platform team to get on board as well. Then you get on board. Having to compete is really the answer there. If you could only give one piece of advice to a team that's getting started with self-service, what would it be? That same one that I said, start small. Start with something you know you can deliver and iterate through solving specific problems as you actually run into them. Don't try to do everything all at once. It's going to be overwhelming to try to shift an entire team's infrastructure into your pipeline. Spend the time to say, hey, we want to standardize and abstract this one piece, then the next piece, and then the next piece, and slowly work our way up. But don't start with trying to do all of it because odds are you're going to get it wrong and you're going to burn up your political capital and your desire for developers to move onto your platform. If you try to do all of it all at once, your first impression is the most important. So start small and work through that.
Bart: What's next for you?
Hans: I have a bunch of stuff going on. I just moved to Sweden, for example. My wife has a baby on the way, so next for me is to discover the challenges of being a dad and figuring out the rest of life, perhaps how to scale child rearing horizontally or something. I don't know. How do I abstract raising a child? No, I don't know. We'll see what happens.
Bart: I think if you listen to a lot of the stuff mentioned throughout this conversation, there are probably many things that can be applied to various aspects of life. As someone who's not a parent but has many friends who are, what you said about not trying to do everything at once, scaling things out, being realistic with what you have, and not responding to a people problem with technology, despite the many wonderful things we have nowadays, is very insightful. There's a lot of other stuff that's not necessarily technological that will come in handy. So, that sounds like plenty. If people want to get in touch with you, what's the best way to do that?
Hans: Sure. You can find me on LinkedIn. You can find me on Medium. I'm sure there'll be a link to the post in question that spawned this. You'll find me on Medium. You can reach out to me via email at [email protected]. I'm happy to chat.
Bart: Okay, that's good. Considering your background and how long you've been working with Kubernetes, and with Kubernetes turning 10 years old this year, what do you expect to happen in the next 10 years?
Hans: That's a great question. Now you're asking me to predict the future.
Bart: And this is off script.
Hans: This is terribly off script. No, this is great. In the next 10 years, I think we will see a lot more development of tools that build on top of things like cluster API. I saw a really interesting article the other day about how to provision hundreds of Kubernetes clusters. They talked about the challenges of using cluster API to build these out. Especially as we step into generative AI, as we've seen over the last year, how do we empower developers to really self-service? They don't have to come to the infrastructure platform and ask questions. They can talk to their own little generative AI, their own little ChatGPT for your platform specifically, that will help answer questions about how to run it. I think we'll see a lot of build-out of eBPF, tools like Kapp, and others that are trying to abstract, reconcile the principle of a service or a microservice or an application with all of the various parts of Kubernetes and how to do that properly. I think we'll see a ton of movement there. We'll see what else will come along. We've seen a lot of change even in the last four years of how to run Kubernetes and in the tech world. I think anyone would be hard-pressed three years ago at the peak of the pandemic to have predicted ChatGPT and the generative AI developments we've seen, like Copilot, and how much they've helped. So I won't sit here and try to predict 10 years into the future too far. Yeah.
Bart: Fair enough. Good. Well, that being said, thank you very much for sharing your time with us today. Really, really like the insights and how you're looking at this. You've spoken to a lot of people, but I do find your way of looking at this to be quite unique. Looking forward to seeing future work as it comes out and best of luck to you in the next steps.
Hans: Thank you very much. Have a great rest of your day.
Bart: All right. Take care.
Kubernetes experts reacting to this episode
Balancing tooling and developer experience: building mission-critical platforms
with Katie Lamkin-Fulsher