Bart Farrell: Who are you, what's your role, and where do you work?
Asaf Savich: so hello, my name is Asaf Savich. I'm working at Komodor. I'm the director of AI engineering, as I said, in Komodor. And happy to meet you guys.
Bart Farrell: So, a Kubernetes setting can look wrong but still feel risky to change once it's already in production. Requests, limits, autoscaling, or probes. What would you tell a team which sees the problem but is nervous the fix could cause an outage?
Asaf Savich: so that's a really good question because once we do a change, it could be any change. it can do like a lot of good things, but along with it, could do a lot of bad things to our production environment. So what we suggest to do is follow the best practices that we have today, right? Not pushing directly to production, to have it first in a staging, dev, sandbox environment, so we can test and check it. And even when we push it to production, if we feel this fix could be risky or anything of that nature, what we suggest is just to release it gradually, to release it to 10% of the traffic, 20% of the traffic. And along we gain enough trust, to make sure that the fix that we've just implemented is doing the things that we intended it to be, then we roll it out to production 100%, but also once with that we have like a kill switch that in case shit hit the fan we can just bring it back and turn it off and just going back to the previous fix.
Bart Farrell: Missing readiness checks usually show up through something concrete. Traffic reaches a pod too early, auto scaling behaves strangely, or users report errors. If a team wanted to catch this before users do, where would you have them look first?
Asaf Savich: so that's a really good question as well. So first, we suggest to have like a profound monitor system to have you controlled and guarded and have the control system other than Kubernetes do the checks as well. So relying on Kubernetes is great and all these readiness checks, probe checks, etc. is amazing, but The fact that only once it gets to the actual resource, you do the checks and only then understand there's an issue, you're in a problem. So what we suggest is a few things. One, to have some kind of a blue-green deployment. So if the readiness checks don't pass, the previous deployment still is not being scaled down. That's first. Second, to have it also in another environment for you to check and make sure that everything is working properly. And third, and maybe the most interesting in today's scenario, is have an AI agent to do these checks for you. If you have an AI agent that does the checks, as Kubernetes would have done, it reduces a lot of the risk that introducing with readiness checks and anything of that nature that could potentially indicate that the system is correcting wrongfully, and it will create a downtime or stuff that's even worse.
Bart Farrell: Now, production readiness reviews can happen before launch, after incidents, during audits, or not formally at all. What would you put in place so Kubernetes readiness gets reviewed before it becomes urgent?
Asaf Savich: so this is a difficult question. And the reason it's difficult because nobody likes production reviews. It always comes in a very bad timing. You always curse yourself, curse the world. Why did they come up to this? Because on one hand, it's very important, right? We want our production to be as stable as possible. And on the second, we have our day-to-day, we have urgent stuff that is always getting in the line. So actually, my recommendation is to take things on a much simpler level, to have also an AI agent that knows your organization, knows your infrastructure, understands the pain points, understands what are the questions and inquiries that are being raised in like reviews, in production reviews such as this. Either they are like informally that are taking place within the organization. or formally that you need to provide to some vendor or third party or like a yearly review that you are performing. If you're always prepared and you have an agent that you only do minor optimizations here and there, so you are safe all the time and when that formally and scary review comes, you are ready. You don't need to do more than this. So this makes your life much easier, puts you in a secure place without compromising on quality, security, etc.
Bart Farrell: We're hearing all about AI all the time in this conference and every conference we go to. I know that Komodor does a lot of work on the AI SRE. There are still a lot of folks in the Kubernetes ecosystem that aren't 100% sure that AI is a great thing to be using on Kubernetes. Where do you feel that it's most useful? And where do you feel like it's still just not there yet?
Asaf Savich: that's also a hard question. So first, as we see the model getting improved and improved and improved, we see which the volume and the amount of issues, problems, stuff like that, which AI can solve is getting bigger and bigger. So I think just neglecting this part, I think is being wrongly justice to the AI. And I think we should use AI, we should use to prove it, but We need to put the correct guardrails. I think AI should be in every investigation. I think an SRE or a platform engineer that is not using AI to solve issues today is doing something wrong. You don't have to accept its result. You don't have to rely on it. But to use it for your investigation, I don't see any reason why not. So that's first. So I think it's amazing. In most cases, it will help you resolve the issues. In other cases, you just got another really good advice from someone that you don't need to take or you don't have to take. So I think AI is getting better and better, and it's already at the point where it can resolve 90-95% of the issues. So let's talk about the interesting part, the other 5%, where it's not. Usually today, what we see with the context window, it's getting bigger and bigger, but it's not in the point in which the AI can read tons of context and understand what is going on. in a very complex production environment in a very complex system where there's a lot of logs, events, metrics, numbers, these are stuff that the AI agent is not that good so far. So in this case, I would add another integration agent, MCP, that can take this big bombastic error and to nail it down and to summarize it and bring it to a point in which the main AI agent that investigates the issue can understand the context without having to read a million lines of logs. So the job is not only about like investigating and have like a main agent to troubleshoot, etc., It is also having sub-agents to help you tackle these mini processes such as summarizing, understanding metrics, understanding timeline, etc., getting this big context and put it into something digestible for the main agent, the orchestrator, would be able to tackle and investigate properly.