Bart Farrell: In this episode of KubeFM, I got a chance to speak to Brian Stack, who's a software engineer at Render, about what happens when Kubernetes scaling hits a dimension most teams never think about, namespaces. At Render scale, hundreds of thousands per cluster, DaemonSets like Calico and Vector were list-watching all namespaces on every node, multiplying memory usage, exhausting node resources, and putting sustained pressure on API server during rollout. We get into how that led to cascading risks like noisy restarts and control plane instability, how they profiled and traced the issue down to specific components, and how a combination of upstream changes and config fixes ultimately freed over 7 TiB of memory across their clusters. This episode of KubeFM is sponsored by LearnKube. Since 2017, LearnKube has helped engineers from all over the world level up their Kubernetes skills through courses. Courses are instructor-led and are 60% practical and 40% theoretical. Students have access to course materials for the rest of their lives, and courses are given to groups, individuals, in person as well as online. For more information about how you can level up your Kubernetes skills, go to learnkube.com. Now, let's get into the episode with Brian. Brian, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?
Brian Stack: Thanks. I think the one I'm most excited about personally and the one that we keep around, which is occasionally useful in the toolbox, but it isn't something we get to reach for all the time yet, is Inspector Gadget. A lot of the time if you have a debugging thing that you can kind of do on one node, but you have to do it on many, Inspector Gadget is an amazing chance to do that sort of thing. And even if you're only working on one node, just having some of the tools it has ready for you in the toolbox with its eBPF is great. There's a similar project from Microsoft called Retina, which is, I think, a little bit more networking focused. And we try to use that as well. And then it's, I suppose, not really a tool necessarily, but it seems like directly supporting nftables is becoming sort of mainstreamed into becoming default supported in Kubernetes itself and a lot of the surrounding tools like CNIs. And that, for us in particular, I think is pretty game-changing. We're really excited about nftables, writing and reading directly from nftables rather than going through the iptables stuff.
Bart Farrell: Very good. And for people who don't know you, Brian, can you tell us about what you do and where you work?
Brian Stack: I am a software engineer at a company called Render. I spend most of my time focusing on our networking. To explain Render briefly, we consider ourselves a modern cloud. Basically, we host code that you write. We expose you to the web, support workflows and all this kind of stuff. But from a networking perspective, my work covers everything from proxying requests coming from the internet to users' pods, which is ultimately what they're running as, or the other direction where requests coming out of their pods going to the internet, and then all of the full mesh of pods talking to each other, basically.
Bart Farrell: And how did you get into cloud native?
Brian Stack: It's funny, I think my first job in this world was at Yelp, probably, 15 years ago or something terrifyingly long ago. And when I joined, what we were doing was we were picking up code and services that were running on bare metal, as we call them nowadays, and moving them to AWS. And so my first job doing systems and infra operations was what ultimately amounted to replicating a bare metal setup in AWS. And then learning all the lessons that we all learned at that time along the way. And I've spent the last 15 years learning how to be cloud native, I suppose.
Bart Farrell: And before that, what were you like? What kind of work were you doing?
Brian Stack: That was my first job out of college, really. Mostly I was on the infrastructure team there. That was everything from working on puppeting up instances to working closely with the team who would literally go into the cage and put hard drives in and out. I've spent most of my career on infrastructure, all the way through working on custom CI systems. At Yelp, I worked a lot on Buildbot, a system that Chrome used. It's not super common for people to have interacted with it, but a lot of projects used it, and that's how we did CI at Yelp. Then I moved over to Mozilla to work with the Buildbot maintainer, who now works at Render as well, to replace it with another custom CI system called Taskcluster, which is open source and people can use it, but mostly it's used by Mozilla. Now I do developer infrastructure at Render. It is not really CI, but it is developer infrastructure as a company. So that's how I ended up here.
Bart Farrell: The Kubernetes cloud-native ecosystem, it moves very quickly. How do you stay up to date? What resources work best for you? Blogs, podcasts, videos, books?
Brian Stack: I'm reading the same websites as everybody else, probably: the orange one and other things like that. Because Render is such a Kubernetes-focused company, and I have a lot of peers who I really respect here, I rely on channels like Library, where people post things like that. So I have people who do that for me, in a sense. I'm all over the place.
Bart Farrell: Good. And if you had to go back in time and share one career tip with your younger self, what would it be?
Brian Stack: I hope this continues to be true, but Linux is always going to be there. Back in the day, you had your physical data center and it was Linux. You move to AWS and it's Linux. Now it's Kubernetes on AWS, bare metal, or Google, and it's still Linux under the hood. So getting good at Linux is a good idea. I looked at one of your previous episodes, and I loved the title: Kubernetes is just Linux. Eric's episode got a lot of traction on Reddit in particular, and I buy that in a lot of ways. Kubernetes is almost a state of mind, and what it's doing is configuring Linux at the end of the day.
Bart Farrell: Now, as part of our monthly content discovery, we found an article that you wrote titled How We Found 7 TiB of Memory Just Sitting Around. we want to dig into this topic a little bit more with the following questions. Render is a modern cloud that lets developers deploy applications without worrying about infrastructure. You handle the servers, the scaling, the networking. Now, under the hood, that means running Kubernetes at significant scale to orchestrate all those customer workloads. Can you give us a sense of what Render's infrastructure looks like in terms of how many clusters, how many nodes, and how you've organized namespaces?
Brian Stack: To give you a sense of scale, I have the official number here: we have four and a half million developers on the platform. Think about that many accounts that we are separating into namespaces. That's our fundamental building block. When you are a customer, you're getting a namespace. We do that for a few reasons. One is operational convenience: it's easy to switch context into a namespace and see all the resources someone has. Also, Calico netpols or Kubernetes netpols in general apply to a namespace, or can easily apply to one. So we have many of those. We have double-digit clusters. We have Google clusters and AWS clusters, and I think we intend to have other kinds as time goes on. Each is hundreds of nodes at a minimum. Each cluster has hundreds of thousands of namespaces. Some of those namespaces might have a pod or two, especially because we offer a free tier, which I'm very proud of and which takes significant effort on our part. Larger customers have hundreds or thousands of pods, plus the associated services and deployments. The way we operate this is primarily by writing operators and DaemonSets. We run many popular open source DaemonSets on every node. We have some custom ones, and we have many operators to operate it all.
Bart Farrell: Hundreds of thousands of namespaces is quite unusual. Most teams talk about scaling in terms of pods or nodes, but you've described the namespace dimension as your hypercube of bad vibes. That's a t-shirt waiting to happen. What makes a larger number of namespaces particularly problematic? And why does that dimension hurt more than others?
Brian Stack: I think there are two answers. One is technical, and one is almost social. The social one is that very few places have this many namespaces. It is one dimension of the Kubernetes scaling envelope that very few people optimize for, because most people run Kubernetes clusters with 10 or 100 namespaces. So we get to pick a lot of low-hanging optimization fruit, because most people do not optimize this dimension. The technical side is that many DaemonSets might list-watch all pods on a node, which is a pretty common pattern. It is possible to filter your requests to pods on that node. Namespaces are different. As far as I know, namespaces do not have a pod or node to filter on. So when you list-watch namespaces, you list-watch all of the namespaces. That is fine for an operator where you have one or two instances, but it is dangerous when you have hundreds or thousands of nodes all doing that against one API server.
Bart Farrell: And DaemonSets run a pod on every node in the cluster. So if you have hundreds of nodes and each DaemonSet pod is independently watching namespaces, the memory cost isn't just per pod, it's per pod times every node. How does this multiplication show up in your clusters?
Brian Stack: From the title of the article, one way it shows up is memory: when you take larger and larger chunks of memory from every node, it adds up quickly once you multiply it. Another way is that the API server has to handle these requests, and they are generally heavy-duty requests. If you have hundreds of nodes, each list-watching hundreds of thousands of namespaces, then an event like rolling out Vector can be very stressful for the API server, the etcd backing it, and everything around it. That is how the multiplication works. You can effectively DoS yourself by restarting Vector or something like that.
Bart Farrell: Calico handles networking and watches network policies, which you define per namespace. You work directly with the Calico maintainers to optimize how it handles the scale. What did that collaboration look like?
Brian Stack: We love the Calico maintainers. Calico is arguably one of my favorite open source projects. We have to run Calico on every node. Render is in an interesting position because we are growing quickly. This came up in the comments on the article, and it was interesting to see both sides. Some people said, wow, you let this problem get to this point, that's nearly malpractice. Other people said 7 TiB of memory is three servers that we run. What can happen is that if you are not spending every day looking at the amount of memory your DaemonSet is using, it can creep up over time. Calico memory crept up over time, and we got to the point where we decided it was not sustainable for us. It was less of a reliability concern and more of a cost concern, because every bit of node that we use is less node capacity that we can sell to people. We noticed this was a problem. We have a pretty extensive setup of the LGTM stack, with Pyroscope, Tempo, and all the Grafana pieces. Calico is written in Go, and Go has fantastic built-in support for profiling. The easiest first thing to do was point Pyroscope at Calico and see what was wrong. It was very obvious where the memory was being allocated. But Calico is a huge, complicated project, and we got to the point where we said, okay, we have diagnosed roughly what the problem is, but I do not think we can fix this ourselves. They have a fantastic Slack, and the maintainers are very active there. We showed up and said, hey, we have these profiles, and explained the problem. They were almost immediately interested, took the profiles, went back, and figured out where the issue was. It was not even an issue in particular. They figured out a way to optimize it by compressing all of these network policies in memory when they are unused. They have every network policy, but maybe only one to ten are in use on a node at a time, so they can compress the remaining ones. Then it became a back and forth: they made a patch, asked us to build a version, deploy it in staging, test it, and then we cycled back and forth passing profiles around. But really, 99% of the credit goes to the Calico maintainers. We just had a big cluster and were able to profile it well.
Bart Farrell: And to validate the Calico changes, you built a staging environment with hundreds of thousands of namespaces. And that's not a typical test setup. What did you learn from running at that scale? And what else did you discover?
Brian Stack: Scaling up to 100,000 namespaces was pretty easy in a way: write a little for loop and let kubectl get to work. Our production clusters have more namespaces than that, so we did not have Kubernetes worries necessarily. In the quietness of a staging cluster, where we were looking closely at everything, we noticed something funny: Vector's memory was scaling exactly proportionally to the number of namespaces we had created. Upon reflection, that makes sense, but at the time it was surprising. We thought, okay, we are getting all these memory wins from the Calico improvements. Can we take a little more time, tack it onto the end of this project, and see what we can find with Vector as well?
Bart Farrell: And Vector handles log collection, not networking. Was it watching that caused the same scaling problem?
Brian Stack: Vector did not have anything to do with the netpols themselves. It was literally just list-watching namespaces. The reason it has namespaces is so you can enrich your log lines with labels from a namespace. For example, if your namespace has a label like this is user 123, you can add that as a field in your log lines. That is one of the purposes of Vector when you are using the Kubernetes log source. For most companies and most Kubernetes deployments, that probably does not move the needle at all, because you might have 10 namespaces and you just pull them in. It lets you enrich logs with that valuable data.
Bart Farrell: So you have this feature that's useful for most users, enriching logs with namespace metadata. But at your scale, it's expensive. When you looked at your actual Vector config, what were you using those namespace labels for?
Brian Stack: It turned out that we were pretty lucky. We have approximately two kinds of namespaces: customer namespaces and namespaces for our own code. Unsurprisingly, those namespaces have a label that says whether they are user namespaces or not. That was the only thing we were using namespace labels for. All of the other enrichment was on the pod itself, through pod labels. This was encouraging, because we thought, and validated by talking to the Vector maintainers, who were fantastic and very supportive, that maybe we just did not need namespace labels.
Bart Farrell: That's a pretty neat workaround. Replace a label lookup with a string prefix check on the namespace name, but Vector didn't have a way to disable namespace watching entirely. You ended up contributing that feature upstream. What happened when you tested the change?
Brian Stack: I forgot to mention how we fixed that. It was easy to remove the namespace label, because our user namespaces also follow a static naming convention. We can just use the name of the namespace, which does not require the namespace object. We removed the label enrichment, but Vector was not smart enough to know that it no longer needed namespaces, so it would not automatically stop list-watching them. We looked at the code, and it was more straightforward than you might think. One of the biggest changes, if you are a Kubernetes shop that spends 99% of its time writing Go, is that Vector is written in Rust. But I love Rust. It is a beautiful language and is actually more readable in a lot of ways. It was fairly easy to find where this was happening, and it seemed likely that we could add a config flag to turn off namespace list-watching. That is basically what we did. Still, this is a system that we run at scale, but not one where we had the deepest knowledge. There are parts of our system where we have very deep knowledge, and others where we say, this tool fits this shape, let's deploy it, and it works. That was more of our Vector story up to this point. So we were nervous about making the change. What if something relied on namespace labels? What if somebody later re-added namespace label enrichment? That is why we wanted to talk to the maintainers, rather than just YOLOing it into production. They were very helpful and said, yes, this definitely makes sense to do. Making the change itself was fairly easy at a surface level. It was probably a day of poking around, getting a Rust build set up, and that kind of work. But it turned out that something in a different file was relying on namespace labels, and our naive approach had missed it. When we deployed this to staging, we got significantly reduced memory, but no logs were being emitted, which obviously defeats the point. This is where LLMs came in handy. They are a great lever when you are working in a codebase that is not familiar to you and that you do not spend your whole day thinking about. You can see it in the blog post. The prompt was not complicated: hey, I am not getting logs anymore, what is happening? It was able to fix that pretty easily.
Bart Farrell: A 50% reduction is a massive win, and you could have stopped there. But your teammate Hieu looked at the numbers and something didn't add up. Vector was still using one GiB per pod, even without namespace data. What made him push further?
Brian Stack: Hieu is the beating heart of the infrastructure team here, I would say. It just did not make sense: where was this extra memory coming from? It sent me off on a deep adventure that turned out to have a rather surface-level solution at the end. One of the ways Go spoils you is its excellent profiling support, especially in a server environment. Rust also has a lot of profiling support, but it is harder to profile in a distributed environment. There is no profile endpoint that spins up as an HTTP server that Pyroscope can point at in quite the same way. We have people here who did research that involved Valgrind. I have never known how to pronounce it, but Valgrind is what I say. We were gearing up for heavy-duty Rust profiling work. Then, while doing one of the deploys to support this, I noticed that we actually had two places where we had been list-watching namespaces the whole time. If you have two Kubernetes log sources, and each one list-watches pods and namespaces, they both do this independently. Maybe Vector could be changed to share a common cache or something, but in our case it was much easier to disable it in both places. That brought the memory back down to negligible, so that was great.
Bart Farrell: You spent hours profiling, and you even asked the team if anyone had experience with Valgrind, which is a low-level memory debugging tool and quite deep in the weeds. Eventually, you found the answer. What were you missing?
Brian Stack: We were just missing that we were list-watching twice. As far as we can tell, if you define two Kubernetes sources, Vector creates two separate list-watches. It might not surprise you, but our Vector config is rather lengthy, and different teams have different perspectives and views into it. I was looking at the part I was most familiar with, but another part had a separate list-watch of all the same stuff. That is how we found that we were list-watching twice, and that both places needed this new flag we created to be set.
Bart Farrell: From nearly 4 GiB per pod down to tens of MiB, and 7 TiB freed across all your clusters. Beyond the raw memory savings, what changed operationally?
Brian Stack: This is the most important part in many ways. Using memory on a node is, in one sense, just a cost issue for us. We are willing to trade some cost efficiency for reliability, performance, and features. But the part I mentioned earlier, where restarting Vector really pummels the API server, had been the cause of multiple incidents for us in the past. You look at Vector and think, it is pretty safe, it is just forwarding logs. But if every node in your cluster is list-watching namespaces, and you have a large enough cluster with enough namespaces, then if Vector starts OOMing and crashing across much of the cluster, you can and will take down the API server. That is the unexpected way to take down the API server. The expected way we also ran into was literally doing a rollout of Vector. Over time, we had added a bunch of machinery to do a very measured, very slow Vector rollout so that we did not impact API server load. But that meant that if you had an issue with Vector and needed to fix it right away, it could take hours to roll out a new version. With this change in place, we can roll Vector like any other DaemonSet or process. It is lightweight and quick. Since then, we have had situations where that came in handy: we needed to quickly change how we were forwarding logs in the cluster, and it was done in maybe half an hour instead of taking a day.
Bart Farrell: So there were at least three moments where you could have declared victory. After the workaround, after the 50% drop, after the PR merged. Hieu's question pushed you past the second one. What kept you going instead of shipping the win?
Brian Stack: Render is pretty good at this, I think. I have really great teammates. I cannot say enough nice things about the people I work with. As a company, we are intent on moving quickly, and there are plenty of times where you should take the win and move on. But because there is a huge amount of surface area for a moderately sized team, we encourage people to dig a little deeper when they get the chance. We find that it nearly always pays dividends down the road. A lot of the techniques we used in this blog post have been followed up with subsequent work in other places. Taking the extra time to truly wrap your head around the problem helps a lot. Getting a project to 100% takes a little longer, but over time it actually makes you move much faster. At this point, we can mostly ignore Vector. If we had fixed it to 80%, we would still be dealing with issues all the time. That is the mindset.
Bart Farrell: You ended your article with a question, do you really need those namespace labels? For teams out there that are running large multi-tenant clusters, what should they be looking at in their own DaemonSets?
Brian Stack: I would say the first bit is to pay attention to your DaemonSets. It can be annoying to keep track of them all, even with all of the metrics you can emit and support these days. Watch for memory increases, and obviously watch for CPU increases. Almost more importantly, watch the API server. The API server emits a lot of fine-grained metrics, or at least we have fine-grained metrics; I think it is just the raw API server metrics we are looking at. The audit logs it emits are also very useful. Watch the API server to see which resources are being requested a lot, and whether that corresponds to an increase in CPU, memory, or something else. That is what you really want to keep your eye on, because when the API server starts to get squirrely, that is a bad day.
Bart Farrell: Brian, what's next for you?
Brian Stack: As I said earlier, I spend most of my time working on networking. We have a project I am working on right now that will deliver dedicated static IPs to customers who want that sort of thing. That is very useful for people who have to talk to a database and want to lock it down to IPs that they own. It will also help reduce our networking costs, which will help us offer networking bandwidth, the aspect we charge on, for less, hopefully.
Bart Farrell: If people want to get in touch with you, what's the best way to do that?
Brian Stack: I have an email at Render, brian@render.com. I have a Bluesky that is mostly personal. I don't talk about work too much there. But if you do want to talk to me there, I think I'm just imbstack.
Bart Farrell: Folks can find you there. Brian, thanks so much for sharing your time, knowledge, and experience with us today. I am sure a lot of people are going to benefit from this, and I look forward to hearing more about your work in the future. Take care. Thank you. It was great.