Fail Forward in Kubernetes

Fail Forward in Kubernetes

Jul 3, 2026

Guest:

  • Konrad Eriksson

Settings in Kubernetes, such as requests, limits, autoscaling, and probes, might seem off, but changing them in production often feels risky.

Konrad Eriksson suggests that teams can build confidence by making small changes, getting quick feedback, and adopting a fail-forward mindset, rather than waiting for perfect pre-production checks.

In this interview:

  • Why production-like readiness is hard to validate outside production

  • How missing readiness checks show up through traffic, scaling, and user errors

  • Why stable and fast-changing environments need different monitoring defaults

  • How quick feedback loops help teams fix problems without relying on rollbacks

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via

Transcription

Bart Farrell: Who are you, what's your role and where do you work?

Konrad Eriksson: My name is Konrad Eriksson. I work at Bifrost Security, and I'm one of the co-founders. Bifrost Security does runtime security for containerized workloads.

Bart Farrell: A Kubernetes setting can look wrong but still feel risky to change once it's already in production. Requests, limits, auto-scaling or probes. What would you tell a team that sees the problem but is nervous the fix could cause an outage?

Konrad Eriksson: I would say deploy anyway and fix it. Fail forward is usually our solution. Do it often, because you don't have too many legacy changes piled up. It should be comfortable for everybody to do, and the delta changes will be small, so it is less likely there will be a big failure. If you're uncertain about something and it's been lying there, get it out. See if something hits it and fix it quickly instead.

Bart Farrell: Missing readiness checks usually show up through something concrete. Traffic reaches a pod too early, auto-scaling behaves strangely, or users report errors. If a team wanted to catch this before users do, where would you have them look first?

Konrad Eriksson: That's a good question. That depends on how stable your software applications are. If they're stable, you can say: okay, we missed it this time; we'll add it for next time. But if you have a fast-changing environment where the checks, thresholds, or the time it takes for things to change a lot, try to have sane defaults in your monitoring. Have metrics for the things, monitor them, and if they stick out of the ordinary, alert the responsible people so they can do something about it for the next time.

Bart Farrell: Production readiness reviews can happen before launch, after incidents, during audits, or not formally at all. What would you put in place so Kubernetes readiness gets reviewed before it becomes urgent?

Konrad Eriksson: It's hard. If you do it in environments that are not production, you usually don't have the same workloads, you might not have the same amount of resources, things may start differently in production. So having them in some sort of pre-checks in early environments is not necessarily helping and it might give you a false sense of security. So I would do the same as the earlier approach: get them out, see how it behaves and then tweak them. So instead of trying to do too much stuff beforehand, have quick feedback once you deploy the new versions or changes and then have a quick feedback loop to fix it and get fixes out. I'm more a proponent of fail forward, basically: not doing many rollbacks but rolling forward. Be ready when you do it to be able to change or tweak something and deploy again.

Subscribe to KubeFM Weekly

Get the latest Kubernetes videos delivered to your inbox every week.

or subscribe via