CoreDNS will fail you at scale (with default settings)
This episode is sponsored by Datadog — a single, unified platform for monitoring CoreDNS alongside the rest of your stack. Try it free for 14 days and get a free t-shirt
In this KubeFM episode, Faris shares his experience managing CoreDNS and scaling Kubernetes clusters with 900 nodes and 15k pods.
He shares the challenges and solutions encountered during an incident, providing valuable insights into maintaining a robust Kubernetes environment.
You will learn:
The importance of scaling the Kubernetes control plane for large clusters.
Strategies for optimizing CoreDNS to ensure efficient DNS resolution and prevent incidents.
The pros and cons of using VictoriaMetrics versus Prometheus for monitoring and observability.
Tips for maintaining a calm and effective team dynamic during high-stress situations.
Transcription
Bart: In this episode of KubeFM, I spoke to Farris about his experience wrestling with CoreDNS when he was working at a very large cell communications company. Scaling the Kubernetes control plane is crucial for managing large clusters. In addition, optimizing CoreDNS is essential for efficient DNS resolution and preventing incidents. Upgrading EKS clusters is also very important to stay up-to-date with the latest features and support. And of course, in all of this, we can't forget the critical element of good communication in teams, keeping a cool head when things go wrong, because they will, and trying to establish a blameless culture. A healthy CoreDNS service is critical to the performance of your Kubernetes applications. This episode is sponsored by Datadog, which provides a single platform for monitoring CoreDNS alongside the latest and greatest in the rest of your stack, whether it's your applications, cloud services, or infrastructure that powers it all. Check out the link that you'll find in the description below. All right, Farris, really quick, which three emerging Kubernetes tools are catching your attention?
Faris: The very first thing is Karpenter with a K. It's a scheduler that helps schedule parts of different sizes to different size nodes. So that's one thing I'd be looking for. The second thing is Cluster API. It's an emerging tool that helps deploy various resources, including VMs. You can envision it as a platform for deployment. You can put a lot of things into that. It's emerging, and many use cases are developing with it. The third thing is using Cilium. I haven't tried it yet, but it eliminates many issues like coordination problems, which we will discuss later today. Many things are changing drastically after eBPF. These are the things I've been thinking about.
Bart: To get a bit of background information about you, can you tell us more about who you are, what you do, and where you work?
Faris: My career started as a WSPA support person. I worked there for around two years and then moved on to building servers. We started building infrastructure and providing it to enterprise customers. Later, I moved on to Kubernetes. That's when Kubernetes native and all those Cloud native technologies started emerging. Basically, we were migrating a middleware shop to a cloud-native shop. From scratch, we were building Kubernetes the hard way. We didn't have EKS, AKS, or similar services in those days. We were building things on bare metal. I later moved on to a project where we were building observability. Mostly, it involved metrics and logging. Most Kubernetes clusters come with Prometheus, Fluentd, and Logstash, which are sufficient for small clusters. But when it grows at scale, it becomes a big problem. We were working on that for the last two and a half years. So, basically, I am a middleware engineer turned into DevOps, and then into observability. That's about me.
Bart: The Kubernetes ecosystem moves very quickly. How do you stay up-to-date with all the changes? What works best for you? Blogs, videos, podcasts, what's best?
Faris: It's actually a mixer for everything. I used to follow a lot of people on LinkedIn who share amazing content like Learnk8s and Mutha Nagavamsi. If something piques my curiosity, I'll just go and check it out. If that article is good, then I'll dig it up on YouTube and look for any videos or keynote speeches. We just go and explore stuff to keep ourselves ahead of the curve.
Bart: Fantastic. And Farris, if you could go back in time and give your younger self one piece of career advice, one tip, what would it be?
Faris: I asked myself to learn coding even before I did back then. I just started coding after five years into my career. If I had a chance to meet my younger self, I'd ask him to do it.
Bart: Right away, when the career starts, that's actually because of the work I was doing in the first five years. Like you said, it's never too early. I think a lot of people would agree with that. All right, so as part of our monthly content discovery, we found an article that you wrote about how default settings for CoreDNS are going to fail. And you know about scaling Kubernetes based on your experience with a real incident. So before we discuss the main story, can we just cover some background? What was the setup that you were using on the Kubernetes clusters that you maintain? Can you share that?
Faris: Sure. Our Kubernetes cluster is actually a managed one. We were using EKS, which is a managed service from AWS. We had over 900 worker nodes and around 15,000 pods running on it. We were hosting an observability platform there, based on VictoriaMetrics. It wasn't just stock VictoriaMetrics; it had some minor tweaks. It was huge for us, the biggest it could get. Initially, we were running Thanos for the first two years of the observability program in our project. That setup went up to 500 nodes. VictoriaMetrics touched 1000 nodes, and then we consolidated it back to 900 nodes. So, that's the environment we had at the time.
Bart: Regarding monitoring observability, Prometheus is probably the name that's on everyone's lips. Yet the setup you just described uses VictoriaMetrics. Do you have any insights you could share with us on the differences you found between the two? Are there pros and cons for both? What's your take on that?
Faris: Sure. If you're starting with telemetry and want to have metrics and dashboards that show what's happening in your applications, you can start with Prometheus. It's a very good place to start. You can use your stock Helm and install Prometheus, Grafana, and everything, which will provide fantastic dashboards out of the box. As you scale to handling terabytes and petabytes, you'll need to scale your observability platform as well. VictoriaMetrics wasn't our first choice. We initially went from Prometheus to Thanos, which is an extension of Prometheus. It worked well for us for some time. But our infrastructure grew so large that we had to catch up. We had issues with Thanos restarting frequently and not performing well at our scale. We evaluated many other options, and VictoriaMetrics caught our attention, so we switched to that. VictoriaMetrics hosts everything on disk instead of using S3 or other storage solutions, which is one consolidation. We also had recording rules for Thanos to pre-compute heavy metrics. This provided faster response times in dashboards. However, these tasks took a heavy toll on our clusters. When we started migrating to VictoriaMetrics, the response times were much faster, and we could reduce our costs. VictoriaMetrics stored data on disk, whereas Thanos stored it in S3. We were investing money in Thanos, but it was going to different components in VictoriaMetrics. Our goal was to make our platform stable and perform at scale. That's why we chose VictoriaMetrics. It's not that Thanos wasn't performing well. It just wasn't performing well at our scale. We also evaluated Cortex and M3DB. Another team was handling that for us. VictoriaMetrics worked well for us, though we had to make some adjustments. We initially tried out the remote write option, where we could run an agent and write data to the cloud. We had many data centers and were exporting data to AWS. All data had to go through an ELB, and we had to pay for all the ingress data into our AWS account. This created huge costs, so we had to make some improvements. We moved to AWS using a different method, which I can't reveal due to company policy. We sent the data in, started consuming it, and pushed it to VictoriaMetrics. VictoriaMetrics has many components. If you have a large infrastructure and a significant budget, VictoriaMetrics might be helpful. For a normal four-node cluster and a stable monitoring system, a standard Prometheus and Grafana setup would be sufficient. For scaling, consider Thanos and VictoriaMetrics. That's what I'd suggest.
Bart: Got it. Regarding VictoriaMetrics, in terms of alerting, is that provided by VictoriaMetrics, or do you and the team use other products? How do you usually set up the alerts on the cluster?
Faris: We already have an alert manager in place from when we were using Thanos. Prometheus. So we were just using the same thing, but we have a component called VMAlert in VictoriaMetrics that takes care of alerting. We have Prometheus-compatible alerts that run against the VictoriaMetrics clusters. It's a different component, but it does the same thing as with Prometheus. The same setup for Alert Manager as well.
Bart: A robust mechanism for observing your cluster can help identify and troubleshoot issues. You probably saw this firsthand when you experienced the incident, which we'll get to in a minute. Can you share what happened on that one fine day?
Faris: We were three people in our team that afternoon when the alert started coming in. There was no data ingestion happening. Since it feeds data for many downstream applications, a lot of people were consuming our APIs along with the dashboards. We were getting a lot of people saying our API was not performing. We suddenly went into war room mode and started checking everything. Grafana was down with a 500 error. Our API dashboard, which we have in ELK, monitors all the queries being put into the VM API. We were checking that. The query count was actually zero. Something was happening. We went into the EKS cluster and logged in. We could see a lot of pods were getting into CrashLoopBackOff. Some components connect to EBS volumes for persisting data on the disk, and some connect to RDS. Grafana connects to RDS. That was also not working for us. We noticed the node numbers were around 900 plus worker nodes. That was drastically coming down. As time passed, it was going down. It went down to around 500. Most of the pods that get the data inside that cluster were controlled by HPA, and those pod numbers were usually around 20. For each site. Those numbers were also coming down to three or four, which was alarming. Something was happening, but we couldn't check anything. Nothing was obvious, so we thought something was happening with AWS services. We checked with AWS, and there was no service outage. We had two other senior engineers with me, and we split up to check everything. One was checking the cluster auto-scaler pods. We had a cluster auto-scaler in place that adds nodes as needed. It wasn't adding anything but was reducing the number of nodes. That had some implications. It was connected to the Metrics Server, which calculates memory usage. Metrics Server depends on the actual usage data for CPU and memory. It was like a domino effect, reducing the number of worker nodes and pods. We couldn't do anything. We tried adding more nodes manually, but even that was going down. We checked all other avenues. We asked AWS to create a new node group since we couldn't due to recently expired support for that particular EKS version. They said they couldn't help with this particular case. We made that node group pegged at a particular number of nodes and started troubleshooting the rest. For RDS, we checked if we could connect to it from outside the EKS, and it was working, but not from EKS. Another colleague checked the EBS, and it was not reachable. It sounded like a connectivity problem, but it didn't strike us initially. We finally found out it was because of CoreDNS. It was just a fluke. We found out it was because of CoreDNS, which was going into crash-loop and had multiple restart counts. It was increasing. We had two CoreDNS pods running in kube-system that come with the default add-ons when you install EKS. We checked the logs and found it was getting overwhelmed with all the requests. We thought of increasing the memory and CPU for it. We thought about increasing it. You can increase it by editing, but it would again remove the connectivity between the pods. That would again lose all the connectivity. We decided to increase the number of pods. That would help without disrupting other pods, just increasing the replica count, so we would have more connections to serve. We initially increased it to three. The services started working fine, but again, it started deteriorating. The connectivity between the pods and to RDS and storage volumes was also impacted. We increased it to five instead of three, and that started speeding up the recovery of the cluster. We had more bandwidth to serve the connectivity among the pods and to outside the pods, and that started recovering the whole system.
Bart: One thing, Farris, just for our guests out there, people listening to this, imagine that this is a pretty stressful situation to be in, in terms of the panic and even chaos going on at this point. A certain amount of stress can help us focus, but after a certain point, it just becomes overwhelming. What was your state of mind and thought process, both yours as an individual and with your team? What was that like and how did you manage to get through all this?
Faris: Our team is actually a pretty cool one. To be frank, we were just splitting the responsibility among ourselves. You go and check this out, you go and check that out, and we split among ourselves and were just checking stuff. Our director is a cool guy and he said, "I'll be on the call. I'll be there to support you. We just get the system up." That was the freedom he gave me. I have been in other situations where they'll put you on the spot and ask you to fix it right away. That itself brings a lot of stress and you can't focus. Even the normal commands you would run will get all the typos and it backfires all the time. Handling things cooler makes the recovery faster. That's what I'd say.
Bart: With that being said, having this cool team of very sharp individuals that handled this, what was the eventual fix? What was the solution that you agreed on? How did that come about?
Faris: The initial fix for this particular issue is increasing the CoreDNS pods. So that is a stopgap solution. Right away, we could do that. The idea is to think about when you are scaling to a thousand-plus nodes, you have to start considering the cluster components as well. Even though EKS is managed, we should start thinking about how control plane components like CoreDNS, metric servers, and others are scaling. You should monitor that, and having an HPA would help. These are very basic steps: increasing the replicas of CoreDNS pods with respect to your cluster size and also having an HPA based on how your cluster is growing. These are the things you would do as stopgap measures. But for a proper solution, you can consider cluster proportional autoscaler and node-local DNS cache. There are other solutions out there. There is a good article on the CoreDNS GitHub page about how to optimize CoreDNS. Go and read that; it will be very helpful. There are a lot of CoreDNS images that provide different queries per second. So there is a lot of proper tuning available. It was actually tuned for your queries per second. There are different images from CoreDNS that would help you choose the correct one for your needs. There are many solutions in the community you can utilize. Also, most people these days are running managed services for Kubernetes. That's easier to do when you are at a scale of 100 or 200 nodes. When you are going beyond a scale of 1,000 plus nodes, you should start thinking about how to optimize your Kubernetes control plane components as well. Even though it's managed. That's something you have to keep in mind. Yes.
Bart: It seems that DNS is a common source of problems for many people. In terms of this particular case, what did you and your team take away from this? What were the lessons?
Faris: The very first thing is, we should start thinking about scaling Kubernetes on the control plane side. We were using an EKS cluster that recently went out of support. This played a major role when we sought AWS support, and they said they couldn't help because it was out of support. So that's the second thing we were considering. We should keep updating our clusters regularly. That should be a priority for the next quarter. And the third point is, we had HPA and alerting for all the other components running in the control plane. Whenever something goes wrong with the control plane, we check it. We never really bothered about the control plane stuff in the last three and a half years. So those are the key takeaways we were considering.
Bart: It's interesting that you mentioned upgrading EKS. Upgrading EKS. In one of our previous podcast episodes, we interviewed a guy named Matt Duggan, and he talked about Kubernetes LTS and upgrades. He said that a lot of teams don't have the necessary skills or sometimes the time to keep up with Kubernetes updates and are sometimes afraid to break stuff. What deterred you and your team from upgrading your EKS?
Faris: Actually, we were migrating a lot of stuff from one version of VictoriaMetrics to another version. So we had a parallel cluster built up. We were in the process of switching to that EKS cluster in a month's time. Like you said, it's a huge effort to migrate from one Kubernetes version to another. So that's actually easier to handle when you are switching from one cluster to another cluster. You build another one ready, switch the things, and you're done. But that's not an option when you have a budget in mind. You'd be doubling your cost if you're keeping that cluster around for a week; that doubles your cost. That will be in the thousands or hundreds of thousands of dollars. I don't know. It depends on your scale. We had the privilege of having a cluster built beforehand because of version upgrades for the application running in Kubernetes. We were able to run and test it, and we just switched things. It's a good thing to do. If you want to go without any issues, you can create a parallel cluster, divert the traffic, and you're done. But if you're going to do a proper upgrade, like you said, that's a huge effort and requires a lot of work. That effort can be spent on other stuff. That's what I'd say.
Bart: Are there any other tips or recommendations about CoreDNS scaling that you'd like to share with our audience?
Faris: CoreDNS, as I mentioned before, comes as an add-on with EKS clusters. It comes with a default setting of around 200 MB of memory and a little CPU, running two replicas by default. Based on your queries per second calculation, you can increase the number of replicas or add HPA if you want it to be automated. You can also consider using different images of CoreDNS, which have predefined settings for a given number of queries per second. Additionally, having a cluster proportional autoscaler is recommended for large-scale deployments. It helps significantly. It automatically adds more pods based on the number of worker nodes or CPUs being added. This is very helpful, and you can also implement a node-local DNS cache. These are some considerations. I've also read that Cilium can be beneficial in this case, but I haven't tried it myself.
Bart: Thinking about your journey, you mentioned having a good team and staying calm. Do you have any recommendations for others on how to not freak out, how to not panic, and how to focus on the important things? I think that's valuable, whether it's for a problem with DNS or other stressful situations in life. What recommendations or strategies would you share?
Faris: Do not panic. We all tend to, but we have to start understanding the situation and focus on the solution. Just write it down. That's what I do. The first two things I focus on are getting a ticket raised and checking the stuff. It just takes a few seconds to write it down, and then go ahead and take them out. Otherwise, you'll lose track of things. If you have good teammates around, delegate tasks to them, and they'll help you out. If you are fortunate to have a blameless post-mortem and a blameless team, that would help you a lot. I'd like to see more of such teams. That helps a lot to get things done faster in those kinds of situations.
Bart: What was the reaction from the community to your story? What did people say?
Faris: The thought I had while writing this was that I had a lot of things on my mind. I just wanted to write it out, and I wrote it in one day. I was making corrections here and there, and we published it. In a day or two, people started noticing it. A guy called me and asked how my cluster was going and how I was debugging stuff. A lot of conversations got started. I have met a lot of people after that. One guy told me, "We don't often see such a big issue at this scale." That was eye-opening for me. We have seen a lot of different issues, but I just didn't have time to share them with people. So I thought maybe sharing a lot would be helpful for others to learn, and we can learn from each other. That's what came to my mind, and I am really humbled by all the responses. Thanks for everything.
Bart: Great. We've noticed that you're quite a prolific writer. What drives you? It's a lot of work.
Faris: Thanks for that first. I tend to forget things sometimes, so I started writing a few things down because I can revisit and refresh my memory. It's like a scribble notebook for me. I just started doing it on Medium, and on the sidelines, it became useful for other people. That motivated me to write more and share more stuff that would be helpful for people. So, I thought it would be helpful for me to write. And as a byproduct, it started being helpful for other people. So, it's something like that. Yeah.
Bart: It's a win-win. It's great. What's next for you?
Faris: A lot of learnings are coming in. I am learning a lot of other stuff, Terraform, Terragrunt, and cloud orchestration. I'll keep writing and sharing things.
Bart: And what's the best way for people to get in touch with you?
Faris: You can get in touch with me on LinkedIn. My email ID is also available there. You can reach out to me through that.