Foolproof Kubernetes with GKE

Jan 16, 2024

Host:

Bart Farrell

Guest:

Mathew Duggan

What if Kubernetes was so easy to install and manage to be foolproof?

In this KubeFM, Mat argues that GKE is the only Kubernetes managed service that offers a beginner-friendly and thought-through experience in running a Kubernetes cluster.

Follow Mat's journey to AKS, GKE and EKS and learn:

How GKE autopilot can help you optimize costs and reduce underutilized node resources.
How the GKE container-optimized OS prevents and eliminates an entire set of security misconfigurations in node management.
How GCP's application of machine learning on the IAM permissions can help you gradually refine security permissions as applications are deployed.

But Mat didn't stop there and had more food for thought:

Are we over-logging and over-monitoring in Kubernetes?
CNI and Ingress have evolved since their inception. What happens now that we are stuck with those decision choices?
Is there a simpler alternative to Kubernetes that is multi-cloud and cloud agnostic, and what could it look like?

Relevant links

Transcription

Bart: A good way to start some controversy in the Kubernetes ecosystem is to talk about the differences, the pros and cons, the trade-offs between AKS, EKS, and GKE. In this episode of KubeFM, I got a chance to speak to Matthew Duggan, who lives in Denmark and works with a lot of startups. Although he doesn't use Gmail, he wrote a post talking about why he likes GKE. But of course, he's tested out the other platforms and had some positive things to say about them as well. Let's take a look at what he had to say in this episode of KubeFM, the podcast that broadcasts the latest and greatest trends in the Kubernetes ecosystem, helping you level up by hearing directly from practitioners. Matt, very nice to be with you today. Welcome to KubeFM. Just to get things started, how are you doing today?

Mat: Great, how are you?

Bart: Good, thank you. So before we start talking about the article that you wrote, just want to go get a little bit of background information about how you got started in tech, what you were doing before cloud native.

Mat: Sure. So my name's Matthew Duggan. My background began sort of sysadmin, Linux admin background, a lot of data center work. You know, it was sort of running around the racks, replacing hard drives, and handling that sort of low-level stuff. I was part of a startup that ran a data center and then started the transition into the cloud and sort of used Kubernetes. My first exposure was as a bridge technology, right, to sort of reuse the right resources that we had in the data center, combine them with those assets that were in the cloud. Because who wants to run a SAN? No one. So yeah, that was sort of my introduction to it. And then over the years, it's sort of just continuously been a skill set that's come up a lot. Adoption of Kubernetes is really high, but there's still that gap between availability of experts and desire to deploy it. So that's just sort of where I've transitioned with my career.

Bart: Fantastic. That being said, if you were given a brand new Kubernetes cluster, what three tools, which three tools would you install first?

Mat: Cert-manager, Prometheus, and Grafana.

Bart: Okay. And to dive in more specifically on Prometheus, why Prometheus and not OpenTelemetry?

Mat: Sure, it's a great question. My experience with Prometheus has been it's really simple to explain to people how this great model works. It's very simplistic and it's very easy to troubleshoot. And the combination of Prometheus and Grafana has been this amazing one-two-three punch of you get the raw web interface of Prometheus with port forwarding through Kubernetes. You can create those graphs really nicely. The Grafana team has done such great work. in terms of troubleshooting and making it very simple to put together these amazing looking graphs and data points. And then combining that back with both the concept of Alert Manager and now with Grafana, the inbuilt alerts, it's just very simple for teams to sort of get started with that metrics journey. But I think that that metrics journey likely ends with OpenTelemetry at this point. I view Prometheus as a bit of a bridge technology at this point.

Bart: And cert manager?

Mat: Do you need certs? cert manager does it great.

Bart: Sometimes it does.

Mat: At some point, someone's going to ask me for a cert and it just does a great job. Okay.

Bart: Any tools that you absolutely would not install on a new Kubernetes cluster?

Mat: Tools that I would not? I think at this point, I would ask, especially with Gateway API, I would probably ask a lot of hard questions before we set up Nginx Ingress at this point. Not just the recent security exploits, which are obviously on everyone's mind, but just I think the concept of CNIs and ingress controllers has really changed a lot since the beginning days of Kubernetes. And I think that there is a default to use Nginx because we all know it as a web server. And I think right now I couldn't really imagine doing that. If I had to use some sort of hosted ingress controller, I would probably end up with HAProxy. just had much better luck with it.

Bart: All right. And as you were saying about the early days of Kubernetes, a frequent conversation in the ecosystem is that, you know, eventually Kubernetes will become so boring, it'll sort of just blend into the background or it'll become a platform for building platforms. As someone who's been using it for a while, what do you think about that?

Mat: I think that that really underplays the level of complexity that you can shoot yourself in the foot with with Kubernetes from the beginning. I think that there's a desire to get there, but I don't think that we're very close to that dream yet. What I'm seeing more talking to startups around Europe is a pretty common pattern of someone is passionate about Kubernetes. They set up Kubernetes and then no one ever touches it again out of fear of breaking it or fear of what an upgrade means. So I think that we're still pretty far from Kubernetes as a boring technology. And there's just some things about Kubernetes, right, that like... are maybe never going to get boring, right? The API and the use of etcd, right? Like, is it something that we should be storing a lot of data in? Not really. Can we design controllers around that? Possibly. There's a lot of pieces and design components to Kubernetes that made perfect sense at the beginning. But now, like, as we look back on them, I think, like, those are going to be foot guns for a long time. And I think that that's something that we're going to have to design around and learn how to design around as we evolve.

Bart: It's nice to know that we're looking forward to an exciting future in that regard. That's not going to become boring.

Mat: I think we'll all be employed. I think we'll all be employed.

Bart: That's good. Fingers crossed. And with that in mind, though, too, you mentioned about startups in Europe. Tell me more about where you're based and where you're working, what you're doing right now.

Mat: Sure. So I'm based in Copenhagen. I moved to Copenhagen from Chicago. And I work for a mobile games company, SYBO. And we make a number of games, but the biggest one is Subway Surfers. We have over 4 billion installs. So I get the sort of unique experience of getting to see traffic from around the world, see high spikes in usage, and really stress test these technologies in a somewhat unique way. games are a little bit different than my traditional software as a service background. So it's been sort of an exciting journey. And I'm lucky here in Denmark that, you know, it's a relatively small community of startups. So we all kind of know each other. And it's been great to sort of connect with folks and talk about their journeys, right? Like these are startups that are five people startup, 10 people startups to getting out of data centers and transitioning from there.

Bart: In terms of what we want to talk about today, you wrote an article when we talk about the different cloud providers and the services that they're generating related to Kubernetes. Oftentimes, folks are left with the choice of, is it going to be AKS? Is it going to be EKS? Is it going to be GKE? Walk me through just your sort of background with this and how you got to the point where you decided, you know what? It's not just enough to have lived through these experiences. I want to put my ideas down in an article and share this with the rest of the world.

Mat: So my background starts very basically as containers are new. Kubernetes is new. We need to get things into containers. Everyone says the word Kubernetes a lot at tech conferences. Me and my infrastructure team knowing nothing, wandering into the woods like everyone does in the beginning of this journey. Setting up inside of a data center and really learning like, oh, there's actually quite a bit of complexity here. It's sort of its own network overlay. It's all these other elements to it. Um, and then when I got out of that experience, trying to run my own Kubernetes across the cloud and across several data centers, I was like, okay, I've learned my lesson. I'm not going to do that again. I'm going to do something different. We're going to use Amazon's it's all set up. It's all ready to go. Uh, I may have gotten involved in that a little bit too early. Um, you know, it, EKS as a product has evolved quite a bit. Um, and so my experience was. wow, it's running in the cloud. That's great. Amazon's Linux is powering these nodes, but there's still a lot of decisions that are being made. And so I was surprised by like, it felt very close to me to running it in my own data center myself. There were some components that they were handling, mostly the control plane. But I think in my experience, running the control plane is less difficult than perhaps some of the cloud documentation implies. And like most of the complexity happens more on the cluster management components. So I felt like we were still doing a lot of managing the load balancers. This is sort of before all of the ingress, you know, annotation stuff came to fruition. And then from there, went to a place that was all Amazon, tried it out in Azure. It had sort of evolved since then. Or sorry, all Microsoft. tried it out on Azure, was impressed. You know, this product is sort of evolving. But I really sort of started to develop this feeling of Kubernetes. When I come into a new job, it's very complicated. I have to go through. My first learning is always to set up a new cluster. Where have we fallen down on Terraform? Where have we fallen down on these configurations? There's always gaps and expectations. Always secrets stored that no one actually put into configuration. Infrastructure as code, like all these things. So I showed up at a place that was all GCP, never used it before, never really spent any time in the GCP council, sat down with GKE and I was like, oh my God, people aren't talking about this, but they should be because this is a much different, much more simple experience than any of the other products that I've tried. And it is also capable of cost management in a way that none of the other products that I tried were possible. And I was so impressed by the experience that I was like, I just feel like it needs to be on the table, especially as more organizations are saying. we're going all in on Kubernetes. We're going, every single retail store has a cluster. Every single branch office has a cluster. And I was like, this is sort of a shame that this product is just not really being discussed in the startup circles that I'm seeing as a real contender for evaluation. It felt like the conversation was run our own, our, you know, Rancher, then, you know, Kubernetes and Azure. And I was like, I just think like the GKE should be a part of the conversation.

Bart: And is there any particular reason why you feel like it's not being part of the conversation? It's a question of marketing. It's a question of visibility. Why is it not getting there, given what you mentioned around simplicity and particularly the element of cost?

Mat: I think part of it is that GCP doesn't do a great job of explaining what they mean by autopilot. And I think that a lot of DevOps type people see the word autopilot and they think to themselves, that means like... Heroku, or that means like some level of management that like hides everything away from me. I also think that there is still this back and forth pressure in the DevOps world of, yes, it might be Kubernetes, but really I'm a Linux sysadmin. And so like, I want to control, I still want that level of low grain. fidelity and control on the nodes. I still want to be able to debug kernel problems. I still want to be able to do some of those things. And while those things are possible on GKE, they are more difficult because of the way that the container optimized OS is designed. And so I think that some of the loss of visibility down into the kernel layer plus... Some of the naming that perhaps is more CTO-friendly than DevOps-friendly means that this product is sort of skipped in the evaluation. Plus, GCP falls into a weird space. I mean, for whatever reason, it's sort of the telecom cloud provider, the retail store cloud provider. It doesn't really come up in a conversation about, like, most web startups, where are we going to put our platform?

Bart: Also, you know, you mentioned in the article as a disclaimer, you are not employed by Google. But I also like that you mentioned that you in general, you don't use Google, you don't use Google software, you don't use Gmail or other Google services. Walk me through that. How did that come about?

Mat: Yeah. So, I mean, part of it was that I just, you know, I wasn't really. I saw some horror stories about people getting their Google account suspended at first. And I was like, oh, like that would be really catastrophic for me. Like my entire life is in this Google account. And then I just sort of started to think about it from the perspective of, are these really the best in class services? And then also, like, do I want to be so locked into one company, right? Do I want or do I want some flexibility? And so I started to migrate my own personal life sort of away from Google. I use Kagi for search instead of Google search. I use iOS. I use Fastmail for email, sort of all those things. And I think the reason why I wanted to put that disclaimer in there is, you know, one piece of feedback I got was like, this reads sort of like a Google. fanboy. And I was like, that's not really the position that I'm in. I think it's mostly just a risk diversity thing. I thought like, well, if I lose one account that I don't lose everything in my entire life.

Bart: You also mentioned one thing I would like to say, if anybody from Google is listening or watching, you can bring back Google Reader if you want to. It was wonderful. In 2007, 2008, my news consumption was consolidated. Everything was really easy. I really, really miss those days.

Mat: I mean, I've gone all the way back to NetNewsWire, which is the classic macOS app that has been rewritten. So I'm okay now. But if they wanted to bring it back, I wouldn't complain.

Bart: Good. Now, another thing you talked about, we were speaking previously about how will Kubernetes become more boring, will it maintain some of the complexities? You identified something in your article about a traditional Kubernetes setup, right? You don't have to walk me through necessarily all the components that are there. But if we're talking about which version to start running on, if we're talking about secrets, if we're talking about CNI, CSI drivers, what are things that, let's say, a company's out there and they're going to start working with Kubernetes, what are things that they absolutely cannot possibly forget about if they're thinking about this kind of a setup?

Mat: Sure. So I think the big ones for me are... The foot guns that I've seen are pretty common. And I think that everyone sees a lot of these. Like the first one is always going to be the over-reliance on the Kubernetes web dashboard. Right. You turn it on. You rely on it for authentication. You don't really secure it correctly. Bing, bang, boom, you're exploited through it. It's a common vector for attack. I know they've made changes to the default, but it sort of comes up a lot. I think for me too, there's a lot of misunderstandings in the business about secrets management. I think that the word secret in Kubernetes is really misleading. I think a lot of businesses assume... Yes, we understand that secrets must be secret for CI/CD systems or for GitHub or whatever. But, you know, this thing says it's secret, so it must be secret and it must be doing something to keep it secret. And I think that there's a lot of confusion about the best way to do that. I also think that there's a lot of misunderstandings about... what do we mean when we say a node OS? Because I think that there's this assumption that we are talking about really a server OS. And I think that it's sort of a bad way to think of it. I think you should think of it as like simply the smallest possible platform that you can to run containers on. And I think that there is a desire to, when you're making the transition to Kubernetes, to say, well, we're going to claim the same sort of... We manage our nodes through Ansible in the same way that we manage. And you're just like, it's a different model. You don't want to be able to SSH into the node itself if you can avoid it. You want to lock these things down as much as possible. And you don't want to be doing these sort of drift managements. So I think that part of it is, too, is getting out of the mindset of the node matters. That's probably the biggest hurdle that people initially have. And then also, the CNI is so important. And it has such profound performance and usage impacts. You know, the AWS CNI, where it's reusing sort of the EC2 network interfaces and it's reattaching these and that's how it's applying the IP addresses. But there's a drift in when that application happens. And so you have to be aware of sort of how many each machine type can be assigned. There's just like nested layers of complexity with the choice of CNI. And I find that often people will pick whatever is sort of the hot one on blog posts when they're setting up the cluster. And they're not necessarily saying, like, is there a way for us to stress test this one or use it in a realistic way to say, like, this is the one that is the best for us or this one makes sense. So I think that that was part of the point of the article, I think, was to say many of these decisions you have to make in the beginning of setting up your cluster, they're hard to change after you set it up. but often they can be the wrong decision or they maybe don't fully express what your organization thought it was going to do or thought it was going to need. And so you end up in this place where it's like, so many times I show up and people are like, we installed Istio because we wanted a service mesh, but Istio actually is very hard to manage or we don't understand it or we don't use like a fraction of its features. And so, you know, it's sort of... A great example of like, you know, now we have this complicated beast that we have to take care of that, you know, it ended up not really paying out in the way that we hoped.

Bart: With that in mind, too, you know, we were often talking about day zero, day one, day two operations. Having worked with these different, with AKS, EKS, and GKE, if you could go back, is there anything that you would do differently in terms of your approach saying, if I had to do it all over again, I wouldn't have worried about this now, I would have worried about it later on? As you just said, there are some things that you simply cannot just put off until later. But maybe there are things that you know now that you say, in terms of priority, maybe there are some things that you can... you can wait for, it doesn't have to be dealt with immediately. Anything that you would give advice to your previous self about?

Mat: Yeah. I think there's a fear of... When you look at the way that cloud providers do ingress now, and you say, like, it's very simple. And, like, one thing that I do now that I wish I had done from the very beginning was to say, use Cloudflare, use their certificate that lasts for 15 years, plug that in as a static value, and then use the cloud providers load balancers as the ingress controllers. And then… Don't worry about SSL rotation for a long time. Don't worry about any of those components for years and years and years. It's managed on the edge and you don't have to think about it. I think the other thing for me would be before you set up a service mesh, they're great and I love them, but like think through, is anyone in your organization going to care about these complicated topology maps of... communications inter-service? Or is this like a bunch of metrics and logs that are going to get transmitted to a dashboard and no one's ever going to look at them? Because in my experience, the biggest mistake I've made in Kubernetes is over-monitoring and over-logging and paying a huge price for those metrics and logs only to then do very little useful with them because the only thing that I ever alert on are like crash loop backoffs. or like failed deployments or things like that. And I think that that's something, it's a natural inclination when you're talking about servers that don't replace themselves to say like, it is very important for me to know about like node health and kernel errors and all these other sorts of things. But I think the biggest thing for me is going way to overkill on metrics and logs capturing, and then also overcomplicating my ingress story from day one. you may get to a point where you need it, but that point is often years and years down the line.

Bart: In regards to that with logs and metrics specifically, and you also mentioned alerts, observabilities, big buzzword, hot topic, is this something you feel is going too far in terms of the tooling of making it overly complex? Or do you think this is the point that could get simplified in the future?

Mat: I think that it's, right now, I think we're too complicated. And I think the default is that we're going to have to do a lot more work there was this trend, the data lake trend, right? Where we said like logs are not just logs. They're not just used for diagnostic purposes, but they're also used for like analyzing customer behavior. Now this trend has sort of like fallen a bit out of favor, I think. Like it's not something that like I have done in a few jobs now, but it is something that I still sort of hear about. And I think the complexities of metrics are similar. Like a time series database on the micro level is very easy to manage, right? And it's very easy to cost control and it's very easy to understand what it's doing. With a high, high, high degree of cardinality that people often think that they want, the cost and the complexity explodes. Because just the nature of these are different series of data, and we have many of them, and they have to get stored and processed. And so you start to see technologies like Thanos and other monitoring platforms which attempt to split this problem or diversify this problem or load balance across this problem. And I think that one thing that we need to get a little bit better about as an organization is to say, like, Now that we have the capacity, which we've never been able to do before, which is like get incredibly low level metrics on every single route, every single pod, every single service, every single like container inside that pod. Do we need all of this? Or can we like dump a lot of them and be satisfied with a higher level abstraction? And what is our actual loss from that? Because that's the dialogue that I don't see happening. It's like, as we grow, we generate more logs, we generate more metrics, they cost more to store, they cost more to transmit, they cost more to parse. What's the ROI? Like, is your uptime that much better? Are we talking five minutes a week? Are we talking an hour a year? Like... And I don't think that there's really a great feedback loop on that stuff. So I think that that's something where I think at some point we're going to recognize metrics are just like databases and storing a bunch of them is really expensive and it's hard and it's complicated and it requires a certain level of expertise. And I think similar to what we've done with database technologies, I think we're going to say the real solution is optimizations ahead of the chain and saying we can simplify, we can capture less, we need less cardinality, that sort of thing.

Bart: With this in mind, and trying to keep things as simple as possible, you explained this notion of a traditional Kubernetes setup. Where does GKE come into all this? And thinking about the trade-offs, what are the things that people are going to gain with a simpler setup that's going to be less painful, that's going to make stakeholders calmer, keeping things on a need-to-know basis? Like you said, sometimes it's not necessary to go into so much detail. How does GKE play into those?

Mat: Sure. So... From the first thing that impressed me, right, was the first conversation you have is, what's the Node OS? And GKE has a solution to this. It's a container-optimized OS. It's like effectively a fork of Chromium. It's designed to do one thing and one thing really well, which is a read-only file system that runs containers. And so you don't need to think about updates. You don't need to think about end of life. You don't need to think about LTS. And because your team sort of cannot muck with the Node OS too much, you don't need to think about updates. That's a whole realm of concerns that you no longer have. The second thing is that it's the CNI that you sort of get by just running default. They gave you a ton of IP addresses, more than you could possibly... possibly need. You could grow to a massive, massive scale. You don't have to do any sort of customization. So CNI is off the table, right? The storage is also sort of baked in. You don't need to make any decisions around that. Pass the request. You get the object. It starts allowing you to store things and allocate disk space. With the adoption of Gateway API, obviously, now everyone is having this. So this is not a special GKE feature. But the sort of flexibility of that ingress technology, internal load balancers, external load balancers, integrating that, I thought that that was all super simple. And basically the decisions that you're making are what version of Kubernetes do I want? How aggressive do I want to be with upgrades? And then how many different types of machine types do you want for nodes? And I was like, this is great. Like it's a great concealed package. But then the one that really stood out to me for startups, especially was the autopilot component where it's saying you don't even make a decision about because the big thing is right. Like you pick a machine type. And everyone does this. And we all pick the wrong one. Because we're like, it is going to be a very large instance. And then you run it for three months, you look at your usage data, and you're like, I wasted $10,000. The thing that really impressed me about the autopilot thing, especially for startups, was you aren't making any of those decisions. You run the deployment. The machine sort of appears mysteriously in the background. And it starts receiving requests. And so when we started to test it in our own testing, we tried to optimize node size, we tried to optimize machine type, but we were seeing huge savings. And then also the ability to say, like, now I have, you know, spot components, I can have GPU, I can have persistent, you know, deployments. I have all these sort of same technologies, but, like, I don't have to worry about almost any of it. And this was the first... so hands-off Kubernetes experience that I sort of couldn't believe it, where I didn't have to think about if I could upgrade, I would go into the web console, click the dropdown and it would say, yes, you can, or no, you can't. And then I would simply like apply the Terraform and wait and simply watch it. And it would simply proceed on its way. And that's when I was so impressed because I was like, this is something that a sole operator could really manage. You don't need even an infrastructure team. You don't really even need to know anything about Kubernetes. You need to copy paste four files, deploy, and then you're pretty much good to go.

Bart: So like you said, in the case of startups that don't have, we can say, the luxury of lots of, you know, hands and minds to be working on these kinds of issues, it does provide a certain amount of comfort, like you said, abstracting a lot of these processes away, things like autopilot. One of the things as well, too, for a lot of startups that they'll struggle with is how much time they can really spend on security. And, you know, this has been a developing trend we've seen in the recent KubeCons, an increasing concern around security, software supply chain, et cetera, et cetera. From a security perspective, what do you find from GKE that perhaps you don't find from Azure, from EKS?

Mat: I mean, I think that the security model across all of them is getting much better. The primary value for GKE for me is the... you know, because you are pushed into this OS that doesn't allow you to do these modifications, you get all sort of like the node security that we've talked about before. They also have basically, you know, on startup, it's all signed with the shielded nodes. So there's no chance that there's been any modifications to this node OS. I sort of can proceed with like complete certainty. But locking down the use of combining like RBAC with Google Groups makes it super simple to say like, what groups of my users have access to what it comes across as a very simple way of doing this deployment versus the often foot gun of RBAC in general, which is like, I have a group of developers. I'm going to give them access to everything. I assume that that's the correct way to go. Plus, Google has a lot of really clever security recommendations. So you go through and it says your cluster is misconfigured. And it's like, it's misconfigured in these six ways. Now, this is something that more and more is happening. I think Azure is a little bit ahead of Amazon on this in terms of actively notifying you about this is a problem. But I was really impressed by how simple it was to say, hey, you should really ensure that users cannot escalate their permissions. You should really ensure that all of these things, your containers are set up not to run as root, that they can't escalate their permissions. These are baseline checks. These are all of the deployments that have them incorrectly set. And I was like, this is great. I don't have to think about it at all. And then the other thing that I talked about was, obviously, we're all attempting to combine the concept of IAM inside of Kubernetes. We don't want to use the default node service account. It was very easy to set up inside of GKE, the same way it is inside EKS, the same way it is inside of AKS. But the primary value for me as a small team was the GCP's application of machine learning on the IAM permissions. To say, you have made service accounts for these applications and you gave them too many permissions. And so we can roll these back, we can modify these, we can shrink these, because we know that they're not using them. And for me, that's a huge weight off my mind. Because I can start with a really big set of permissions for every single application, see what they're using in production, and then safely scale that back down over time versus the model that I find so many organizations go with, which is either too much or developers have to sort of laboriously apply for every single permission set that they want that is then transcribed into infrastructure as code.

Bart: Like you said, around the... the part of access control and permissions, these are things that we're hearing more and more about in terms of Kubernetes policy is providing guardrails to make sure that people are only focused on the things they should be focused on. And also to your point about misconfigurations, I think this came out a couple of years ago, maybe there was a report from Red Hat that said 71% of Kubernetes vulnerabilities were created because of misconfigurations, because of precisely what you were referring to. And for small startups, once again, you don't have enough time, you don't have enough budget to be able to have eyes and ears everywhere. So making that simpler, I can imagine, is a positive attribute. What about, you know, let's turn to the other side. When shouldn't an organization use GKE? When would you not recommend it?

Mat: So like I tried to outline in the article, my experience at GCP has been a mixed bag. Some things about GCP I really enjoy and some things about GCP, like, clearly, like, don't get a lot of love as an organization. And so I think my concern with GKE would be... If you are heavily dependent on, let's say, Redis, your organization's huge, you need a lot of complicated features, it's real expensive in GCP, and it's real bare bones. So the service consistency, say what you want about AWS, but typically my experience has been that their service consistency is quite high. And if they launch a service, it's pretty good. some exceptions, Elastic Beanstalk. But in general, they're pretty good. They have a high degree of success. GCP is a very mixed bag, right? Like the first version of like, this is the example that I always tell people. How comfortable are you running Vault? And if you're like, I don't want to run Vault, I want to run my platform, Secrets Management. I'm like, GCPs is very bearable. The first version was effectively an S3 bucket with secrets that were encrypted, and then you sort of handled the decryption part yourself. And I was like, this is incredibly bad compared to where Secrets Manager is in AWS, like compared to sort of where the rest of the organization is going. The second version is better, but not by much. And so again, that's a classic case of if you're using a few secrets or if you're okay with a lot of manual management, GKE is a good choice and you can use this platform. If you're not, then a lot of these things, if they're not giving what you want, your only solution is to run them themselves. And that would be where I would get a little bit concerned. If I was an organization that was... I have a ton of needs for secrets and I don't have any headcount to manage something as complicated as Vault or I don't have the budget to do hosted Vault. I would say like, I don't know about GKE. Think long and hard about it. Try out the Secrets Manager. Like as an example, right? Secrets Manager in AWS has the rotation functionality which triggers off a Lambda. GCPs is it puts a message into a PubSub queue and then that's your problem. And a lot of stuff works like that where it sort of just becomes your problem. So I think that in that case, if you are looking at creating a traditional lift and shift, moving a bunch of stuff from a data center into the cloud and recreating that, and then adding GKE on top. So I mean, tons of VMs, tons of running Postgres, tons of running Redis, and then also Kubernetes for the new stuff. I don't know about GKE. GKE is better as a full buy-in. And it's best if you use it as like, you know. these are the parts we're big into data analysis. We need lots of data set because they have advantages, data store and stuff like that. Like you have all these sort of backend technologies that are great. But yeah, I think that that would be my primary hesitation. Like I outlined it in the post, I was like, you know, the GKE setting up the load balancers is great, but the SSL certificates are a disaster. Like I tried to provision this public certificate. It took four hours. And I was like. We live in a universe of let's encrypt and SSL certificates don't typically take four hours to get. So I don't know what's happening. And it's like the most hilarious error message because it's like contacting CA and you're like, I don't know what that means. Like, how long do you wait? Is it broken?

Bart: Good. No, I think that that, you know, the approach is sound. I think going back to a previous point that you mentioned, though, is like. From a positioning, from a branding perspective, that GCP, you said sometimes is maybe siloed in this idea of telecommunications or that we're going to have labeling that's more CTO friendly. Do you anticipate in the future that some of these things will get smoothed out or will continue to focus on a particular sector or a particular end user in mind, like you said, like a CTO?

Mat: I think that GCP is in a little bit in a fight for its life. I think that it is. He has very clear from the outside with no sort of inside information, but just as someone who follows a lot of their stuff. I get the sense that Google has adopted a much more aggressive approach towards profitability. And I think profitability in the cloud and increased revenue in the cloud is a major driver. I think that telecoms, AWS has a long series of sort of relying heavily and maybe abusing HTTP a little bit as like the way that its structure works, right? Like the way it sort of global system works is like a lot of network forgiveness. I think that the GCP networking is overall in better shape because it's sort of written on and running on Google's rails. But I also think that this is a little bit of a foot gun because I think that it increases this emphasis on more and more and more complicated networking features, which are just not relevant to. I was talking to a group of people and a telecom that uses GCP was like, we use all of this reserved IP space that Google frees up for us. And we were all looking around in the room and we were like, that's baffling. Why would you ever need more IP addresses? And so I think that there's a little bit of that, right? Where there's a little bit of focus placed on the telecom needs and not as much on the web developer needs. But my hope is that we can start to see that. GKE escaped the shadow of this because I think that there is a desperate need for it. I think that there is a desperate need to say, I'm making a new business. I want to start it on a scalable platform. I want to use Kubernetes because I know I'll be able to hire for it in the future. I also know I'll be able to transition to new providers in the future if I want to. And I think Autopilot gets you there. And I think GKE in general gets you there without locking you in too much. So I hope it gets better. So I hope that it gets a little bit more friendly. I think that there would have to be a lot more community outreach at this point to get the news out, where I think it's falling down right now.

Bart: All right. Well, Google folks, I could mention some by name. I hope you're listening. No, but really, I think it's an open invitation for those conversations. Do you anticipate to do a deep dive around Amazon or around EKS or AKS in the future like you've done for GKE?

Mat: Yeah, actually, I'm working on the AKS one right now. So I'm really excited to get that out. It's a little bit longer. So there was a lot of feedback of like, we'd like to read this for other things. And so I have one for DigitalOcean and then I also have one for AKS coming out.

Bart: Fantastic. In terms of the feedback that you got for this one about GKE, what was the response from the community?

Mat: I think there was a very small but vocal component that was like, I would never do this because it's not really Linux and it's not really open source technology. And I think that part of that boils down to the container optimized OS is not like an OS that you can go out and get. You can compile it, but you can't install it and test it on your own machines. And I also think that there is a bit of hesitation about... taking a platform as flexible as Kubernetes and doing as much vendor lock-in as you can sometimes get with GKE. Now, you don't have to, right? Like you have the ability to sort of opt out of those technologies. But I think EKS's whole thing was you opt in or you run it yourself. And I think GKE's is you opt out. And I think that that like initial approach is a delineation that makes it much more simple, but also creates like a little bit of tension when people look at it. But overwhelmingly, I think the feedback that I got was like, I have wanted to run Kubernetes, but I have been afraid of it. And that was, I think, overwhelmingly what I heard, which was people were like, I wrote six files and some Terraform. I turned on autopilot and it just started to get traffic from the Internet. And I'm really happy. So I was super pleased by that. That's exactly what I want to hear.

Bart: That's great. Like you said, the 20% of the haters out there are always going to make a lot of noise. But no, but I think that's, once again, you're putting yourself out there. You took the time to do it. For the people that are going to complain, it's like, well, feel free to write your own article and I'll be happy to give you feedback on it as well.

Mat: Yeah. I mean, I'd love that. You know, one of the big reasons why I write this stuff is because I am so desperately interested in... more open communication between tech companies. Because I think that we're all recreating the same experiments and we're all trying the same technologies. And it's just like, I joke that one of the primary reasons to go to tech conferences is to get two beers in the guy from XBakeCorp or the woman from XBakeCorp and just be like, what doesn't work? And they're like, oh, let me tell you the secrets of Cassandra at scale or whatever, you know, like any technology. And you're like, oh my God, this is amazing. Like the secrets, like I know what it's going to do at scale. Like I've learned so much. And I just would love to see more of that and people just saying like, this works, this doesn't work, it's exciting or it's not exciting. It would just be great to not have to like sort of learn it yourself every time.

Bart: Yeah. And it seems like you've done a lot of learning on your own, right? In this process. No, but I think it's, I think it's a sign, you know, of an active mind and being hungry for knowledge, sharing the knowledge that you wish you had had, you know, before going into all of this, you know, being, being an answer to the questions that you had previously. And I think it's, I think it's a valuable service to people out there and also an encouraging sign to others that they can do the same thing, you know, whatever it is that they're working on, that they can put those things out in the open. And what we get from all these conversations too, is that so much of this is about learning how to learn and what's gonna work best for you. In your case, you mentioned, really using a lot of the Kubernetes documentation. I think you talked about blue hyperlinks that are now purple, from clicking on all the different things there. Everybody learns in a different way. So whether you go in person to a conference or attend virtually, take advantage and ask the speakers questions. Like they want to be asked questions. That's why you will go and give a talk. So I think that's a great thing to keep in mind. You said you're working on the post about AKS. Apart from that, any other hobbies or side activities you have going on that you'd like to share?

Mat: Yeah. I mean, I think in general, right now there's been a lot of... I've recently written a post about, so I went to this startup meetup and it was like very young people. And I wrote this post that was like idiot proof infrastructure. And the idea was, what is the absolute minimum that you would have to construct in order to set up? what I would consider to be baseline efficient infrastructure. And it was about like using Cloudflare as a load balancer, as sort of an abstraction level, and then relying heavily on Cloud and NIT to sort of stand up these servers with the understanding that like they're disposable. And then using this great technology called Watchtower, which pulls Docker containers from private registries without you having to do any sort of deploy step. But it also has like an HTTP endpoint that you can set up to ping it to trigger a deployment. And at the end of all this experimentation, I was like, this might be sort of something. Like, this feels like it might be a functional product. So what I've started to do is started to write effectively this Golang CLI. And the idea is that, like, everyone talks about multi-cloud. Everyone talks about AZs. what if you could do, because Cloudflare load balancers are at an abstraction level, what if we could make it so that, like, with one CLI, you could split your server groups between, like, Herzer and DigitalOcean or AWS and whatever? And sort of through the use of Cloudflare tunnels, like, you could even have them be private VPCs. And so we could really completely remove the surface area of attack. We could really completely remove the concept of, you know... auditing, so much of the stuff would go away. And so we tried it out with a couple folks in Copenhagen. I took this horrible monstrosity of a CLI. And I sat them down and I was like, does it work? And the feedback was like, kind of. But that was sort of the idea. I was like, this is actually maybe kind of interesting. And it's been a super interesting learning journey for me to talk to people because It helps understand what are startups doing now. That is so much different than what it was like when I was doing it seven years ago. The landscape has really completely changed, which is great, and it's why this field is a great field. But yeah, that's been sort of my hobby right now, which is trying to come up with, is there a way of saying we have a completely cloud platform agnostic idiot-proof infrastructure that... relies on containers that could potentially be a stepping stone into Kubernetes is sort of the long-term dream.

Bart: Once again, I think it's of enormous value for people that are terrified of not even knowing which questions to ask or where to start. And it's true. And a lot of people may not even want to admit it. They know that they're going to need this, whether it's now or in six months. But because of the learning curve, barrier to entry, it's something I actually talk a lot about in the CNCF is... How can you get non-technical stakeholders onboarded, whether it's through the Cloud Native Glossary or whether it's through certifications like the KCNA? And there's still a lot of work to be done. And so I think it's great that you're out there making it easier for folks to get involved. If people want to reach out to you, if someone from AWS, when they watch this and want to hire you or from Azure, what's the best way to do so?

Mat: The one and only social media for me is Mastodon. Or my website is my name, MattDougan, with one T. And like I'll say to, I try to say in every article, I love feedback. I love finding out that I was wrong because it means that like, it's great and the thing can evolve and develop into something new. So if folks find out about this from your podcast and go there and find out something wrong, just. totally open, totally receptive feedback, not pushing for new readers. You read whatever you want, but if you happen to go there and you find something wrong, I'd love a, I love a note. It can even be a public note. I don't even mind. I just had the, I had this great experience with an Amazon engineer who was like very polite, but was like on another post was like, I need to. chat with you for 15 minutes. And I was like, that was very, that was very educational. Thank you so much.

Bart: Was it only 15 minutes? I have to ask.

Mat: Hey, no, it went a little long. I had some questions that evolved into other questions that then hit some, you know. don't, I can't violate my, you know, privacy. An NDA and whatever stuff like that. But I was like, I have to ask more questions about what you're talking about because you made references to like some load balancing structure. And I was like, I've never heard this before. But yeah, I think in general, like the other thing that I would love to get out to the audience is just like, you know. I meet so many interesting people that have such interesting stories to tell at tech conferences, at startup meetups. And it's such a shame that we never hear from any of them. And if you are a person out there who has a very unusual story about technology or how you got involved or your career path, I would hope and I just want to encourage people, we need more new voices desperately.

Bart: We need...

Mat: As a tech community, I guess I'm part of the problem, part of the solution or whatever. But if you're listening out there, you're starting out new in this field, don't think that nobody wants to hear from you. I'd love to hear from you. And I think everyone else would too. I would love to get a fresh set of perspective on these problems, fresh new ideas. So that's just my note of encouragement for anyone who's ever been like, my experience never matches up with any of my coworkers. No one ever seems to talk about what I'm going through. I'd love to read your blog. I'd love to read your articles. Like, please. And if, you know, if I could be of any assistance in that whatsoever, like don't hesitate to reach out.

Bart: You know what? This is about the 10th episode that we're recording for this podcast. And that's by far one of the best lessons that anyone's shared, which is not a technical one. And, and I really, I thank you for saying that this is going to be a great clip because The idea of this podcast as well is to get those voices, the folks that aren't necessarily out there giving talks all the time or writing blogs, saying like, hey, there's really not that much to lose here. And as long as you contextualize things and say, look, this is just my experience. And if someone responds in a bad way, that's their problem. And because sometimes people are afraid of the call out culture, that things can be very toxic and common. And I completely understand that. And I've seen that and I've gone through it. But in general, most of the people you meet are very supportive. And if you frame things correctly, you can most likely, like you said, people will help you maybe find a mistake that you made and guide you into a more correct answer. And for some people, giving a talk in public is terrifying. Most people don't like public speaking, which is totally normal. But that's why you can write a blog. You can contribute to documentation. You can do tutorials. There are lots of different ways to contribute. So just about figure out what's going to work best for you. Does everybody need to become a YouTuber or a content creator or something? Hopefully not, all right? Because we're already saturated. But I think it's a wonderful thing of you to share. And for people that are maybe a little bit shy or hesitant, if you've gone through something as a difficulty, somebody else has too. So if you share your experience about what worked best for you in terms of troubleshooting and resources that you might recommend, that's a gift to somebody else to help them get them through the same. obstacle that you've been dealing with as well. So I really liked that reflection. That being said, you can expect to hear from all the cloud providers, whether it's on your blog or on Mastodon. I definitely look forward to the AKS post coming out. I think it could be a good opportunity to do some follow-up. And I also really appreciated the amount of detail that you shared in the article. We'll be linking that when we upload the podcast so people can check that out. And I hope you'll hear from some of our listeners. So thank you very much for all the work that you're doing and keep it up. I just, I don't know where you get the time to do this stuff, but. Whatever your secret is, keep going because it's working.

Mat: Well, thank you so much, Bart. I really appreciate the time and the opportunity. And it's been a lot of fun. And I love that you're in the same time zone as me. This has been amazing.

Bart: Long may it continue. Yes. Yeah, don't go anywhere. So when we do the next call, we can stay in the same time zone. But yeah, we will be in touch. I look forward to your next steps in the future content you'll be creating.

Mat: Okay, great. Thank you so much.

Bart: Cheers.

Mat: Cheers.

Listen anywhere

Kubernetes experts reacting to this episode

The evolution of managed Kubernetes: from infrastructure to workloads
with Gari Singh
Efficient observability in Kubernetes: from data collection to troubleshooting
with Julia Blase
The future of databases and observability in Kubernetes
with Peter Zaitsev
Observability in Kubernetes: insights on data collection and alerting
with Stéphane Estevez
Simplifying Kubernetes, hands-on learning and on-premise
with Fahd Ekadioin
Metrics ingestion, security tools, and effective debugging
with Felipe Martinez Amaral
eBPF, Kubernetes operators, and upgrading clusters
with Lili Cosic
Metrics ingestion, data decay and entity-centric observability
with Miguel Luna