How We Cut Build Debugging Time by 75% with AI

Mar 17, 2026

Host:

Bart Farrell

Guest:

Ron Matsliah

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.

Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.

In this episode:

Why combining deterministic rules with AI produces better results than letting an LLM guess alone
How correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errors
Why integrating into existing workflows and building feedback loops from day one drove adoption
The prompt engineering lessons learned from testing with real production data instead of synthetic examples

The takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.

Listen anywhere

Transcription

Bart Farrell: Build failures are one of those quiet productivity killers in engineering teams. Logs are messy. Errors are cryptic. Developers rerun pipelines and hope for the best. And platform teams pay the cost in time, frustration, and wasted compute. Today, we're talking about how to actually fix that. My guest on KubeFM is Ron Matsliah, a DevOps engineer on the developer experience team at Next Insurance, where he works on platform engineering, CICD, and Kubernetes at scale. Ron recently wrote about how his team cut build debugging time by 75% by combining Kubernetes context, rule-based classification, and AI, not as a shiny dashboard, but directly inside the tools developers already use, like Slack. In this episode, we get into why logs aren't enough in complex Kubernetes CI/CD pipelines, how spot instance interruptions and infrastructure issues create misleading failures, why they combined deterministic rules with AI instead of letting an LLM guess, and what actually changes when developers get clear, actionable feedback instead of noise. If you care about developer experience, platform engineering, or making Kubernetes-based CICD less painful, this one's for you. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped thousands of Kubernetes engineers all over the world level up through their training courses. They are instructor-led and are 60% practical, 40% theoretical. They are given in groups as well as to individuals, in person, and online. Students have access to course materials for the rest of their lives. If you want to find out more information about how you can level up Go to learnkube.com. Now, let's get into the episode. Ron, welcome to KubeFM. What are three emerging Kubernetes tools that you are keeping an eye on?

Ron Matsliah: Hey, thanks for inviting me. The three tools that I'm keeping an eye on is Karpenter for scheduling and management, the LGTM stack, which is the Grafana stack, and Argo CD.

Bart Farrell: Very good. And for people who don't know you, Ron, can you just tell us a little bit more about what you do and where you work?

Ron Matsliah: Sure. I'm a DevOps engineer at the Developer Experience team at Next Insurance. The Developer Experience team is part of the Platform Engineering group. Next Insurance is a digital-first insurance company helping small businesses get tailored insurance to a seamless online experience. My job is to be responsible for the whole developer ecosystem, from the local development to the CICD pipelines until to the production deployments. Help them do it fast without any issues. My main focus is on building a resilient system, which is easy to use and maintain.

Bart Farrell: Fantastic. And Ron, how did you get into Cloud Native?

Ron Matsliah: And so cloud native and automation was always my passion. I love the idea of infrastructure as code and resilient system that can scale automatically without me need to do anything to do that.

Bart Farrell: Very good. And what were you before getting into cloud native?

Ron Matsliah: So I started my career 10 years ago as an automation engineer, mainly focusing on developing infrastructure frameworks and the CI-CD pipelines, I found the opportunity to move forward and take the role of a DevOps engineer. And I took this opportunity and I'm really happy with it.

Bart Farrell: And Ron, the Kubernetes ecosystem moves very quickly. How do you keep up to date with all the changes that are going on? What resources work best for you?

Ron Matsliah: So first of all, my group is performing internal tech talks once a week. Each of the group members can share and show us a technical session about what they did. Also, I have a subscription to the TLDR newsletter and also some online communities around the cloud-native, Kubernetes, CICD sector.

Bart Farrell: And if you could go back in time and share one career tip with your younger self, what would it be?

Ron Matsliah: Mainly focus on solving real issues, real problems, and not just focus on the new technology.

Bart Farrell: Okay. As part of our monthly content discovery, we found an article that you wrote titled How We Cut Build Debugging Time by 75% with the DevEx AI Assistant. So we want to dig into this topic a little bit more with the following questions. So first of all, build failures are one of those daily frustrations that silently drain developer productivity. Before we dive into the solution, can you paint a picture of your infrastructure? What does your CI-CD setup look like with Jenkins and Kubernetes?

Ron Matsliah: And so Jenkins is our main CI-CD. We are using AWS. We have multiple accounts, a few tens of AWS accounts. Each account has a Kubernetes cluster. We are fully automated. The CI-CD process is seamless. Once you merge your change into the main branch, a pipeline has been triggered. builds, tests, deploys, whatever I needed. And our main challenge here is around the spot termination during the CI-CD builds and the cryptic errors because of it.

Bart Farrell: So within that environment, what was actually happening when builds failed that made you tackle this problem?

Ron Matsliah: So we made a few surveys and we saw that due to complexity, In the build process, Jenkins supports parallel steps, which make the logs look messy, and a lot of active components are working there. So for a developer that isn't familiar with the ecosystem in depth, it's hard for them. So they answered that it took for them more than 45 minutes in order to debug an issue. Most of the time, they approach the on-call platform engineer, asking for guidance. A lot of time also they said they just hit the rerun button hoping it will solve the issue. We didn't know a lot of the issues. Also we have cost because of that. If you have a compilation error and you hit rerun it will not help you. if you have some configuration issue it will not help you So the core issue was the lack of visibility around all of this ecosystem.

Bart Farrell: You built an AI-powered assistant rather than just better logging or documentation. Walk us through how it works when a build fails.

Ron Matsliah: So on Jenkins pipeline, you have a post section that is divided to success, Unsuccess, failure, unstable, etc. So unsuccessful is any status that is not green. So when one of these statuses is encountered, we perform an API request to the service that we built, the DevEx Assistant, providing the job name, the job number, the ID of the person responsible for running the job, and we immediately return back a response that we got your request, we started analyzing it, release Jenkins to finish the pipeline. After that, the service starts in parallel context gathering. It calls to the Jenkins log. Again, we don't want the full log. It's long. It's parallel steps. Jenkins has APIs that you can see what stage failed. So we're just taking the relevant logs from there. We collect in our backend is on Kotlin, so we collect in JUnit reports. Kubernetes events to understand a spot interruption out of memory, things like that. We combine it with predefined rules and classification, feed it to the AI, we get an answer in a few seconds and the result sends to the relevant person or group or channel in a few seconds.

Bart Farrell: You categorize failures into groups like compilation, tests, and infrastructure. Why was this classification important rather than just showing raw AI analysis?

Ron Matsliah: So for us, for the DevEx team, first, we are data driven. So if we know that most our failures are related to spot terminations and infrastructure issues, we want to know about it in order to fix it. For the developer side, they instantly know what happened. If it's an infrastructure issue, they really show that it's unrelated to their changes. If they have a configuration, compilation, things like that, they know what they need to do and where to go. We have six categories for different actions and from, again, from surveys that we did, it saves a lot of time. we decreased the debug time dramatically.

Bart Farrell: One interesting design choice was combining rule-based classification with AI. What led you to this hybrid approach?

Ron Matsliah: So feeding AI with a lot of logs can cause unpredicted analysis, an unpredicted answer. In order to reduce the chance that will happen, we found that rules help us to be deterministic. If we know from asking the cluster event that our node failed because of spot interruption for example we can say it's spot termination it's no other reason we don't need to give the ai and to analyze it we give it more complex things like test fail or compilation error and there it shines more it also guides it if you see something like that classifies as X. confirm and explain, add a short explanation. It saves us a lot of token cost because we send less logs, we set how much we want to get back in response, and we get better consistency and correctness for its side.

Bart Farrell: Rather than building a dashboard, you integrated directly into Slack. How did that impact adoption?

Ron Matsliah: So our developer using Slack day-to-day before the assistant era, we send a notification from Jenkins, your build success failed with a link to the build to see what happened. So they were already there, familiar with the tool, using it. So we just enhanced the existing solution to bring them something more. They don't need to context switch. You don't need to send them to other tools that they don't know how to use it. They can send in the message itself a feedback if it's accurate, not accurate, and give ideas to improve it. And in case they need help, they forward the message to the platform support channel and ask for help. And for us, the on-call engineer, help us to know what to look for.

Bart Farrell: The feedback loop seems central to improving the system. How do developers provide feedback and what do you do with it?

Ron Matsliah: So on Slack, we have something called Slack Modal, which adds a button to the message and gives you a pop-up that you can choose things. We want to keep the feedback as simple as possible. So they have a thumbs down, X for incorrect, and free text. field if they want to suggest something or say it's no good for some reason and we store it in MySQL database and for us to analyze it later dashboard internal dashboards and also we get the feedback in another Slack channel that our team is there and if we see we have a correctness problems or some other issue we can always as soon as possible and address it for immediate fixes, category improvements, and accuracy.

Bart Farrell: One feature that stood out is detecting AWS spot instance terminations in your Kubernetes clusters. How does that work?

Ron Matsliah: So at the start of the pipeline, we question the cluster, what node we are running, what are our pods on which nodes they are run. We create some kind of a map that maps between the pod, the agent pod for Jenkins. If you created a developer environment for testing, we create a file with all the mappings there. Pod A belongs to node A. And then on a failure, we query the Kubernetes events. And we get a machine that got terminated during the trial. we try to correlate between the lists and if we see a correlation we know for sure that our agent or pod failed because of spot termination. If before the developers saw in the logs channel close unexpectedly or Jenkins tried to store something on the file system and did not find it all many cryptic errors that you can't understand now they just get You had the spot termination. Please, rerun, try to rerun.

Bart Farrell: Getting consistent output from an LLM is not easy. What did you learn about prompt engineering?

Ron Matsliah: So it was a trial and error. We saw, as I said before, when giving the AI freedom, just feed it with logs, it's not persistent. So we put a constraint there. We give it the format that we want to get there, like the category, like the root cause, like the stage that failed. We give it real example that happened. If you see this kind of error, please categorize it as spot termination or compilation or whatever. We found that also that feed it with real data and not synthetic, it's helped him be more accurate. So it's an ongoing task. We iterate every time again and again to make it more accurate.

Bart Farrell: What additional context beyond the log itself improve the AI analysis?

Ron Matsliah: So first of all the infrastructure events like spot out of memory, node status, things like that help us to be persistent. The metadata of the build itself, if we see that the build is poor because the user decided so, it's important to send it. If it's a timeout also it's important also the metadata itself like the command that failed like the stage name things like that and also we have in some cases that we triggering downstream builds you can trigger from one Jenkins build another one so we also feed it with relevant information from the downstream in order to be more accurate in the analysis and JUnit reports and a lot more. The key insight here is that simple rules and context is much better than just complex query to the AI alone.

Bart Farrell: Now let's talk about results. What impact did you see after deploying the DevEx Assistant?

Ron Matsliah: So we have a reduction in the time that our developers needed in order to debug an issue. Most of the time, they know what the issue is and fix it. So it's a few minutes. But in some cases, for example, we saw that one of our pipelines used an old Karpenter annotation about do not evict, do not disrupt in order to not remove our node. We forgot to update it. Karpenter decided the node is in drift mode and decided to terminate it. But the job killed during the process. It helped us in this direction. Also, more than 85 suggestions were rated with a thumbs up, as accurate, and we saved a lot of developer hours per week.

Bart Farrell: Beyond the numbers, how did this change the developer experience when builds failed?

Ron Matsliah: now, a lot of them didn't want to go deep inside and search because they don't have the permission to and they can't enter the cluster so now it's intrigued their curiosity a lot of time mostly the more junior developers come to us i saw i had an issue with the node can you explain me what happened for details and also they have a clarity now they can understand for sure what happened. They have actionable steps instead of a vague direction where they need to go. And again, everyone benefits from it because now they can understand the issue themselves.

Bart Farrell: If someone wanted to build a similar tool for the Kubernetes-based CICD, what are your top recommendations?

Ron Matsliah: So what we learned during the process that It's prefer to integrate to existing workflow like Slack and Teams or whatever you're using. Instead of creating new application, the developers need to enter, login, see boards, understand what happened, and also make the analysis actionable with specific next steps on what you can do in order to solve it. Build the feedback loop inside the solution, In our case, Slack. in order for them not to context switch and provide feedback. And I think one of the most important things Test with real data. Feed it real data to be as accurate as possible for your company use case.

Bart Farrell: Ron, what's next for you?

Ron Matsliah: So for me, I want to continue to focus on developing practical tools to enhance user experience by leveraging cutting-edge technology like AI, like another technology that outdoes today.

Bart Farrell: And what's the best way for people to get in touch with you?

Ron Matsliah: So you can find me on LinkedIn. My name is Ron Matsliah. You can find me there. I will be happy to connect and keep in touch.

Bart Farrell: Fantastic. Thank you so much for sharing your time and experience with us today. I'm sure our audience will enjoy very much hearing about how you did all the things that you did. We're more curious about how they can build those things themselves. Take care and hope our paths cross again soon in the future.

Ron Matsliah: Thank you very much. Bye-bye.

Bart Farrell: Cheers.

How We Cut Build Debugging Time by 75% with AI

Relevant links

Transcription