Balancing tooling and developer experience: building mission-critical platforms

Balancing tooling and developer experience: building mission-critical platforms

Guest:

  • Katie Lamkin-Fulsher

From platform engineering to developer experience: exploring modern Kubernetes practices at scale.

In this interview, Katie Lamkin-Fulsher, Staff Product Manager of Platform and Open Source at Intuit, discusses:

  • How GitOps practices shape platform engineering at Intuit, focusing on auditability and maintaining a single source of truth for deployments

  • Their approach to Progressive Delivery which led to a 50% reduction in incidents through automated canary deployments and rollbacks

  • Improving developer experience through AI-powered tooling, including automated log analysis and build failure summaries

Relevant links
Transcription

Bart: Who are you? What's your role, and where do you work?

Katie: My name is Katie. I am a staff product manager of platform and open source at Intuit.

Bart: What are three Kubernetes emerging tools that you are keeping an eye on?

Katie: As a product manager, specifically of Argo CD and Argo Rollouts at Intuit, the number one emerging project I'm keeping track of is our new Argo Promotions project, which focuses on environment promotion in a declarative way. The second project I'm tracking is K8sGPT. One of the biggest problems we have internally is being able to summarize the outputs from our build logs and deploy logs that occur during failures, and making sure we can produce outputs in a Gen AI way that is consumable by our developers. The fact that K8sGPT is functioning in this way for Kubernetes issues is really interesting to us. The third project I'm tracking is a new one announced by Reddit a couple of days ago, called Akri SDK, which is an easier way to write Kubernetes controllers, however it seems there might be some confusion, as the project is actually called Achille SDK.

Bart: Next questions are based on comments given by our podcast guests. We are talking about GitOps and platform engineering. One of our guests, Hans, argued that GitOps is an excellent building block for building platforms with great developer experience. He mentioned the ability to merge, review, and discuss code changes and pull requests, and the additional benefit of not granting permissions. Should all platforms use GitOps? What's your experience?

Katie: That's a great question. At Intuit, we use a flavor of GitOps, which has worked out successfully for us. However, in terms of whether every company should use DevOps or GitOps, it really depends on the company's situation. Certain features, such as auditability and having a single source of truth for all resources deployed into different environments, come out of the box with GitOps and are desirable for every company. Nevertheless, GitOps is not the only way to accomplish these things. Companies can achieve the same results using other tools, although this may require building custom solutions.

Bart: Speaking about availability and platform engineering, our guest Hans compared delivering software now to 20 years ago. He mentioned that while downtime was acceptable in the past, it isn't today. Hence, building platforms on top of Kubernetes requires more tooling than ever, which can be managed using practices like GitOps and tools such as Argo CD. Is it possible to keep tooling from sprawling at bay? What kind of tools are essential for building mission-critical platforms, especially when considering Progressive Delivery strategies like Canary Deployment?

Katie: As the Kubernetes landscape continues to evolve, there's going to be more and more tools that come alongside it. In terms of availability, the number one thing I've seen success with at Intuit is using Progressive Delivery. An example of this is Argo Rollouts, which a majority of our service workloads run on today. We use a Canary Deployment stepwise deployment to improve availability in production. When we make a change and deploy it into production, we send 10% of our production traffic to the new change. If we sense something is wrong using metrics, we're able to automatically roll back. With the adoption of Progressive Delivery, we've seen a 50% decrease in incidents compared to non-adoption. As we continue to advance our availability strategy, we are looking to utilize a new open source tool called New Approach to generate an anomaly score. This will make our rollback process more sophisticated and quicker, moving beyond static metrics to a more dynamic approach.

Bart: Platform engineering and people. Our guest already shared that rushing into solutions without understanding the root cause can lead to fixing symptoms instead of the actual problem. He mentioned the case of network policies and how sometimes the root cause of a problem is a people problem, and the solution lies in addressing that. What is your experience with providing tooling and platforms on Kubernetes to other engineers? What are some of the soft challenges that you've faced?

Katie: That's a great question, and it relates directly to my talk at ArgoCon yesterday is not present in the links table, however Argo CD is. We've noticed that when using Argo CD UI, platform engineers find it extremely intuitive and easy to use. However, when it's used by application developers who aren't as experienced with Kubernetes, it can be daunting and confusing. This can lead to people causing outages. As a platform team producing technology and tools to enable developers to be more productive, it's essential that we understand their perspective and prevent them from making mistakes. To achieve this, we've introduced fine-grained RBAC policies to ensure good coverage. Another area we're exploring, related to K8sGPT, is using GenAI to summarize issues with our builds and deploy strategy within Argo CD. Currently, developers have to sift through thousands of lines of build logs when they experience a build failure. We've built tools that take these logs, generate a summary, and provide feedback to developers, giving them a concise knowledge base to solve issues instead of having to review the extensive logs. We've accomplished the same with Kubernetes logs within Argo CD.

Bart: Kubernetes turned 10 years old this year. What should we expect in the next 10 years to come?

Katie: The landscape has completely exploded over the past 10 years of Kubernetes, with numerous companies now providing a wide range of open-source tooling for configuration, security, controllers, and operators. As the landscape continues to grow, so does the complexity of the project. In my opinion, over the next 10 years, a strategy will form to keep Kubernetes usable for both platform and application engineers. This will involve building layers on top of Kubernetes to abstract and simplify it as an orchestration tool, allowing it to continue evolving with the times, possibly incorporating concepts like GitOps, and other tools such as Argo CD, Argo Rollouts, or Progressive Delivery to manage the complexity.

Bart: Katie, what's next for you?

Katie: What's next for me is looking into that problem internally. At Intuit, we are building an AI-powered runtime tool to abstract Kubernetes from our application developers, allowing them to focus on writing code without worrying about the underlying infrastructure. This is a huge priority for Intuit, and I'm excited to be an active part of it.

Bart: How can people get in touch with you?

Katie: I am on CNCF Slack is not provided, however, CNCF can be linked to https://www.cncf.io/. You can reach out to Katie Lamkin-Fulsher, or you can reach out to me on LinkedIn as well.

Podcast episodes mentioned in this interview