Platform engineering: balancing tools, teams and real-world needs

Guest:

Billy Thompson

This interview explores how to build effective internal platforms and make informed decisions about cloud-native tooling.

In this interview, Billy Thompson, Head of Global DevOps & Platform Engineering at Akamai Technologies, discusses:

Why organizations should validate the need for platform engineering by identifying specific toil to reduce and ensuring proper scale before implementation
How to build internal platforms that get adopted by involving all stakeholders and leveraging existing open-source tools instead of custom solutions
The importance of choosing infrastructure tools like Terraform or Crossplane based on actual requirements rather than following industry trends

Relevant links

Transcription

Bart: So, first things first, who are you, what do you do, and where do you work at Akamai Technologies?

Billy: My name is Billy Thompson, and I work for Akamai Cloud Computing (formerly Linode). I am a DevOps and platform engineering specialist on the cloud CTO team. This means that for any customer engagement involving the team, if the workload involves DevOps or platform engineering, I am the resource they bring in.

Bart: Sounds good. Now, Kubernetes has many different features, but which one is your least favorite?

Billy: The most recent pain point I ran into was issues with getting Cert Manager to pass the HTTP validation when using proxy mode with the load balancer. I had it working before, but it doesn't work all the time. I spent a few days digging deep into troubleshooting this, which was great overall because every time that happens, I learn a lot. The problem was not unique to me, and many others have experienced the same issue. I find it frustrating that there hasn't been a more universal approach to deal with this, especially when discussing observability and security. When using proxy mode, it's essential to monitor traffic in web logs. Although I found this issue frustrating, I got some good ideas on how to work around it, and I might turn it into a course or share it on Dev.to or LinkedIn later. There's always something good to find in the challenges we face.

Bart: Now, in terms of the questions we're going to cover today, the first one is around tooling platform engineering. Our guest stressed that standardizing everything makes cluster management easier. What's your advice for building platforms that can be used by several teams in an organization?

Billy: First and foremost, it's essential to answer the question of why you need to build an internal platform. What part of your toil does it reduce? Is it your toil or someone else's? If it's not going to resolve the core of some toil that you have, then it's probably just going to add more silos. Platform engineering may be one way to address that. However, it's also worth noting that platform engineering becomes more effective at a certain level of scale. A lot of advice suggests that this is what you should do, but are you always at the scale where you necessarily need to do that? Trying to do it too soon may also add more toil.

Understanding what you're trying to solve and being able to answer the question of whether it's what you need is crucial. If you've determined that building an internal platform is a good idea, it's still likely to struggle to be effective if you don't get the love and adoption from the platform users, such as SREs, dev teams, and internal customers. My advice is to make sure they all have a seat at the table and that this isn't just something coming from the top down.

Every company culture is different, and there's more to this than it may seem. However, a key piece of advice is not to reinvent the wheel. Use things like Kubernetes, which has reconciliation loops and orchestration systems that do most of the heavy lifting. When you think about it, Kubernetes does the heavy lifting to maintain a desired result. We should leverage open-source tooling that is platform and cloud-agnostic, community-supported, flexible, battle-tested, and works wherever we deploy, rather than writing our own custom code.

While there are scenarios where writing custom code makes sense, consider that every time it's your code or a managed service layer that you use on top of this, it's a dependency on a particular vendor, which eats into flexibility and portability. These may be problems downstream, and when you hit those problems, if you can't move, it's an even bigger problem.

For learning purposes, I think it's great to roll out your own IDP from scratch. However, for piecing something together for production, what is the easiest to maintain? My best advice is to use whatever is flexible, portable, and makes it so you don't have to work any harder than you already do. This sounds like a good strategy.

Bart: Just a quick follow-up question on the subject of silos. We had a podcast guest, Michael Levan, who wrote an article that stirred the pot a bit, saying that he thought silos were a good thing because of the idea that specialization inside a team is helpful and that someone isn't expected to be a jack-of-all-trades and perhaps get spread too thin. What's your take on that? Specifically, Michael Levan's article about silos

Billy: I believe that's an interesting perspective. When we talk about eliminating silos, there's a lot of dogmatism to that approach that may not be necessary. The question is, what part of the silos is the problem? This goes back to what we're trying to solve. Looking at the origins of DevOps, there was a problem noted with the friction between people in the software development cycle and those maintaining the infrastructure, which made it difficult to move forward. Having a continuous feedback loop can help, where structurally and culturally, each person's problem is also the other's problem. This means needing visibility into each other's world to make things easier. However, this doesn't necessarily mean totally eliminating silos to the extent that everyone can see everything and nobody specializes in what they do. Instead, it's about identifying the barriers that prevent people from doing what they're good at and letting their specialties complement each other. Specialization is still necessary, but being overly dogmatic or hyped up about terminology like "break the silos," "multi-cloud", "GitOps" or even good methodologies like "test-driven development" can cause more friction than intended. Does this address the question being asked?

Bart: So, the next question is about GitOps and infrastructure as code. One of our guests, Dan Garfield, said that using Kubernetes as a central data store allows tools like Argo to detect and sync drift in your infrastructure, also known as configuration drift. In comparison, tools like Terraform externalize their state and are harder to track. Is the market moving to tools like Crossplane to provision infrastructure and away from Terraform?

Billy: I want to say yes and no because it depends. For all intents and purposes, Terraform can also reconcile drift when talking about infrastructure. The core of this question is around automation or being able to automatically detect and reconcile drift, which Terraform on its own was not purpose-built for, unlike Crossplane. At a certain level of scale, absolutely. In practice, I've seen Crossplane rise to the cause and take over, especially when talking about enormous organizations with geographically diverse teams and a large amount of tooling, managed services, and different cloud providers used by various teams. A little bit of configuration drift can really cause problems. Having something that can be a continuous reconciliation loops and facilitate a lot of that, rather than relying on large Terraform states, is the way things are going in that context. However, I would also point out that questions like this usually come up or best practices around this come up when looking at large-scale deployments. Even at the enterprise level, not everything is necessarily at that scale and needs that degree of continuous reconciliation. Many workloads and different levels of staging, development, testing, internal products, and B2B type products can run in Docker containers with a daily backup strategy and a nightly Terraform state refresh without overcomplicating things. There are also use cases in the SMB and mid-market, and even in startup land, where things are not at a scale that requires Crossplane. The last team I worked on had to support a product that generates a few million in revenue, a customer-facing product, but due to organizational constraints, it doesn't run in Kubernetes; it runs on VMs. Our choice of tooling was Ansible because it was good at provisioning infrastructure and configuration management. Ansible is easy to learn, maintain, and provides consistency across deployments. It can be used to detect and resolve configuration drift, making it a great choice of tooling. This is a scenario where Crossplane was not an option. There are plenty of other examples like that. I wouldn't say the industry is moving away from tools like Terraform. In certain circumstances, at a particular scale, and heavily relying on the automation of Kubernetes, absolutely, but not entirely.

Bart: Cattle versus pets, infrastructure is code. Dan Garfield shared that your cluster only feels real once you set up ingress and DNS. Before that, it was just a playground where it didn't matter when stuff broke. What are your thoughts on this? To me, just having ingress and DNS still isn't quite enough to make it more than a

Billy: When you can hit that welcome to Nginx page and it has SSL, that's still a "hello world" from my point of view. The point at which it really becomes something serious is when it's more than exploration, when you're using it because you've identified a problem that it can solve for you. As a nerd, I love setting things up, breaking them, and figuring out how to fix them, even on my personal laptop computer. This is why I started using Arch Linux in the first place - I wanted something that would break on me so that I could learn to fix it. Back when Arch was more difficult to use, it was a great learning experience. However, if I'm looking at a real use case, such as deploying Keycloak to issue just-in-time tokens for authentication and using OpenID Connect to access an internal Git repository, then it becomes more serious. For instance, I want to control which Kubernetes resources and namespaces different team members have access to, so they have what they need, their own dashboards, and so on. Now, we have a real use case, something we actually want to deploy and use internally to help us succeed. Identifying choking points and areas where things are not going smoothly, I think this would be a solution to those problems. At this point, I'm taking my cluster with my ingress and my welcome to Nginx page, and I'm looking at adding Git, Argo, and Keycloak. I'm looking at turning this cluster into something we can use for a real-case scenario. And then it gets serious. Troubleshooting is not as fun when it's not going well.

Bart: Now

Billy: The transcript was autogenerated from an audio file.

Bart: I talk to a lot of people and do a lot of interviews. You have obviously thought about a lot of these things and express yourself in a very thoughtful way. Who are the people in the ecosystem who you look up to the most?

Billy: I think the person who inspired my particular style of troubleshooting is Elvis Segura, someone I worked with in professional services. His troubleshooting methodology, which involves dividing the problem in half repeatedly, inspired me to find joy in troubleshooting and thrive in chaotic situations. I enjoy certain levels of complexity, such as Kubernetes, which can have a difficult learning curve. However, I was excited to learn about it and didn't find it hard because I was eager for the challenge.

I also had a former manager who was very experienced and would often pick my brain on particular ideas or issues. He believed that regardless of seniority or experience, there's always something to learn from others, even junior engineers, who can offer a different perspective. This attribute has resonated with me, and I look up to him for it.

In terms of resources, I find that the best conversations and most exposure come from interacting with customers, especially when things aren't going well. I also attend conferences, where I try to keep an ear to the ground and look for inspiration about new things to try. I enjoy hearing stories about how people solved problems and learning from their experiences. While I have trouble listening to podcasts due to my ADHD, I find that one-on-one conversations at conferences are very valuable.

I recommend trying to get your company to sponsor conference attendance or submitting proposals to give talks about topics you're passionate about. You can also look for student or learner discounts. Meetups are another great way to learn from others. I also find the website Dev.to to be a great resource for technical content. It features posts from people at various stages in their careers, including those who are just starting out and excited to share what they've learned. The site also has more advanced material and cross-pollinates different topics.

One thing to consider is that even if you're an expert in one area, such as Kubernetes, you may still need to start from the basics when learning something new, like a distributed message queue. Dev.to is a great place to find articles that cover fundamental concepts and provide a clear understanding of new topics. I think it's a great resource, and I'm surprised that many people in the industry don't know about it. I often recommend it to others, and I encourage everyone to check it out, create a profile, and start writing.

Bart: We have done four seasons of our podcast, and this is a question we ask almost everybody in terms of resources. This is only the second time this has ever been mentioned, and it could be the only time it's been mentioned. People refer to media, and they talk about Reddit and Hacker News, among other places. However, Dev.to is not often mentioned. I have seen it, but in terms of interviewing people, we have interviewed a ton of people, and it hasn't come up yet. So, I think that's very useful. Billy, what's the best way for people to get in touch with you, you work for Akamai Cloud Computing (formerly Linode).

Billy: The best way to get in touch with me is honestly just LinkedIn, as it's what I check most often. I'm always happy to receive feedback or ideas, and if you disagree with something I've said, I'd love to hear your thoughts. Please don't hesitate to reach out to me on LinkedIn. As I mentioned, I travel to a lot of conferences and am on the road frequently, visiting many cities in the US and Europe, and occasionally the APJ region. Before I attend a conference, I usually post about it, so you can meet me in person if you're interested. I'm working on improving my communication about my travel schedule.

Bart: Perfect. Billy, thank you very much for your time today. Look forward to seeing you in the future.

Podcast episodes mentioned in this interview

Clusters are cattle until you deploy ingress
with Dan Garfield
How we are managing a container platform with Kubernetes
with Ángel Barrera Sánchez
Optimize the Kubernetes dev experience by creating silos
with Michael Levan