Kubernetes Resource Optimization with Humans in the Loop

Mar 3, 2026

Guest:

Ray Chen

Most teams know their Kubernetes workloads are poorly sized — but tuning CPU, memory, and JVM settings across dozens of services doesn't scale when your platform team lacks the context of what each service actually needs.

Ray Chen, Head of SRE at Trumid, shares how his team moved from manual Grafana-based tuning to automated resource optimization with humans in the loop — keeping engineers focused on business value instead of resource toil.

In this interview:

Why Trumid doesn't set CPU limits on most workloads and monitors SLOs for trading APIs as the real signal for right-sizing
How they integrated StormForge into their CI/CD pipeline with read-only automation that proposes changes via PRs — letting service owners approve instead of blindly applying recommendations
The key lesson from trying homegrown scripts and cost dashboards: the missing piece wasn't visibility, it was trust in the automation combined with business context

The thread that ties it together: automation alone isn't enough — teams adopt optimization when they retain control over their decisions.

Transcription

Bart Farrell: Ray, welcome to KubeFM. For folks who don't know you, can you tell us about who you are, your role, and where you work?

Ray Chen: My name is Ray Chen. I work for a fintech company here in New York called Trumid. We are a corporate bond trading platform. I've been here for coming up to 11 years, and I've played a lot of very different roles from a technical perspective, starting out as an engineer on our product side and moving to platform engineering team, and then most recently, leading up our data and intelligence organization. I think for this conversation and for your audience, I think what's going to be most interesting is my experience on the platform engineering team, where I was responsible for everything from cloud infrastructure and engineering to Kubernetes, CI, CD, packaging deployments. I think that's basically it. Oh, observability. I'm sure a lot more. Everything that you need to run a platform, that was my responsibility.

Bart Farrell: All right. It's kind of the one-stop shop. Good. And what are three Kubernetes emerging tools that you are keeping an eye on?

Ray Chen: I think the first one is Karpenter. I like anything that lets teams think in terms of workloads instead of individual nodes. And Karpenter keeps getting better at that abstraction. The second thing are things like GKE Autopilot, which kind of takes Karpenter to the next level where we don't have to worry about nodes at all. So even when I was responsible for cloud engineering and for our Kubernetes clusters, having something like GKE Autopilot allowed us to not have to worry about provisioning and do we have the right size nodes and things like that. And the third one, of course, is StormForge. We've been using it for a couple of years now. I still think of it as emerging because the space is still quite young. They've been a great partner of ours. They've continued to take feedback, improve the product, and help us to really adopt their product through our pipeline.

Bart Farrell: And Ray, we did a survey that found that 57% of teams want continuous optimization, but 56% still manage resources manually with kube control and dashboards. How do you approach this at Trumid? Is it automated, manual, or somewhere in between?

Ray Chen: It's mostly automated, but with humans in the loop. As I said before, we try to do everything we can to reduce toil, reduce friction, allow our engineers to focus on the value that is truly and uniquely Trumid's. Having to tune your resources, like figure out, do I have the right amount of CPU? Do I have the right amount of memory? That's toil. That's something that we should be able to automate. And so we have done quite a lot to integrate StormForge into our pipelines. But with anything that's automated, especially on production workloads, you still want to maintain that human in the loop. Because things can happen. Everybody knows what it means to write buggy software, even if you're a mature platform like StormForge. Accidents can happen. Anything that's unpredictable could happen. So instead of saying, hey, we can go ahead and just let any automated process go and tweak our resources, we have a check. basically we have some of our own controls in place that say, okay, well, these are the recommended values that we want to apply to our production environment for CPU, for memory, for replicas, et cetera. And somebody who is intimately familiar with those services and that space can say, oh yes, this makes sense. This reduction or this increase is totally in line with my expectations. Go ahead, approve the PR, and then that gets shipped out and applied on the next release.

Bart Farrell: CPU throttling and OOM kills were the number one pain point, affecting 45% of teams. In financial trading where latency matters, how do you deal with this?

Ray Chen: So we think about it in two layers, right? There's the infrastructure layer where we don't set CPU limits on most workloads. We monitor things like CPU as a percentage of requests, memory usage and growth and JVM heap. And we alert well before we near any out of memory errors or sustained throttling. Basically, the idea here is to be proactive, to be monitoring your system so that you know if it is approaching something that is unexpected. The second layer is on the business layer where we watch SLOs for key trading APIs. So things like latency, error rates, availability. If those move, if it does something that is moving in a direction that we're not comfortable with, we'll overspend. We'll size up. more CPU, more memory, more replicas, because it's very important for us to deliver consistent experience to our clients.

Bart Farrell: 79% of teams rely on production metrics to set resource values, but many admit it's basically look at Grafana and add 20%. What's your process for turning metrics into actual requests and limits?

Ray Chen: So we let StormForge do most of the hard work for us. So they are continuously monitoring our workloads. They turn those metrics into actual recommendations. It looks at both our container CPU and memory usage, as well as our JVM settings. So we have a large percentage of our workloads running in JVM. And if you were to treat container memory in isolation and ignore JVM memory, you can run into other issues like, well, unexpected OOM kills. So, I think that the main thing here is that, we don't have to focus so much on individual settings for every single service. And instead, we focus more on what are our goals for an environment. So for instance, in production, we tune that environment more towards reliability and performance, which means that we're okay spending a bit more money. But in our non-prod environments, our test environments, we don't need to have all this extra headroom because if we have P95 times that are a little bit slower than expected, it's okay if it comes at a significant cost savings. So we'll do things like that. And StormForge makes that super easy. You just basically set a config or if you prefer the UI, you can just pull a slider. Pretty straightforward.

Bart Farrell: Only 32% of the people in the survey use VPA and HPA, and just 6% use commercial tools. Most still do it manually. What's holding back adoption of automation in your view?

Ray Chen: I think most teams, they don't lack tools. Everybody has pretty much ready access to a VPA or HPA. I think what they lack is trust. The way that we've approached it is to keep automation read-only in the cluster. So have it propose changes. and then open a PR and then let somebody who understands the service and the use case and the business context behind those services to be able to step in and say, that looks right. We had a ramp up in volume and so we expect that there's a 10% increase in CPU and memory, or this particular service has been superseded by a new version of the service. And the CPU and memory requirements have dropped off and we can see that at 20, 30, 40, 50%. reduction is acceptable. So it's this kind of combination of automation plus guardrails, or human in the loop, which makes it easy for our teams to adopt this type of automation.

Bart Farrell: There are plenty of Kubernetes resource management tools in this space, among them kubecost, cast AI, Goldilocks, etc. But at Trumid, you decided to go with StormForge. What won in your comparison? And what was the thinking process that led to that decision?

Ray Chen: That's a good question. We were looking for more than just a cost dashboard. We actually spent a couple of years just using cost dashboards and putting together our own Grafana dashboards and trying to understand which services actually needed to be tuned. This was during a time where we wanted our product engineers to focus a lot more on client deliverables. And so the responsibility fell on my team, the platform engineering team, to make sure that workloads were right sized. it quickly became clear that this wasn't a scalable solution because one, there were many services and two, my team just didn't have the context of what those services were. Was it necessary for it to be over-provisioned by 50% or 100%? Maybe not. We had to go and continuously talk to service owners. So it felt very much like we were just a middleman and we really wanted to move that responsibility back to the teams. So using a product like StormForge, it allowed us to do things like, continue, have a service and a process that would continually monitor and right size, CPU memory and JVM settings per service and let teams stay in control. And even though we had pretty good cost visibility and even tried our own homegrown scripts at some point I tried, seeing if I could just write a script that, effectively kind of mimicked what StormForge did. They all ended up, kind of in the same place requiring, humans to validate and verify against Grafana and then just add a little bit of extra headroom. So it wasn't sustainable. It wasn't something that I wanted to continue supporting. It wasn't what I would call one of our core competencies. And StormForge was willing to partner with us to help us build a solution that fit the way that our teams operated. So it wasn't just like, hey, here's our product. Go ahead, run it and pay us the money. Thank you very much. they actually continue to work with us to support us to make sure that we could take their product and make it a success in our production environments and non-production environments. We also needed a solution that considered the JVM as well as the containers to give us control over both. So as I mentioned, it's important for you to not just reduce container memory, but also proportionately reduce JVM memory settings. If you do one without the other, unexpected things or bad things will happen. And StormForge made it relatively straightforward to do that, as well as to tag the appropriate teams when their services needed adjusting.

Bart Farrell: Ray, what's next for you?

Ray Chen: Like I said, I changed that responsibility from platform engineering to data intelligence. So a lot of what I'm doing these days is still about reducing toil. But now with the help of AI, trying to get AI permeated all throughout our organization. using it to augment our capabilities for engineers as well as for everyone else here, non-engineers at Trumid.

Bart Farrell: Well, Ray, thank you so much for joining us and sharing your experience with us today on the topic of resource optimization, the decision-making processes used at Trumid. Wishing the best of luck and I hope our paths cross soon. Take care. Thank you.

Kubernetes Resource Optimization with Humans in the Loop

Relevant links

Transcription