98% faster data imports in deployment previews

Oct 29, 2024

Host:

Bart Farrell

Guest:

Nick Nikitas

This episode is sponsored by Loft Labs — simplify Kubernetes with vCluster, the leading solution for Kubernetes multi-tenancy and cost savings.

Are you facing challenges with pre-production environments in Kubernetes?

This KubeFM episode shows how to implement efficient deployment previews and solve data seeding bottlenecks.

Nick Nikitas, Senior Platform Engineer at Blueground, shares how his team transformed their static pre-production environments into dynamic previews using ArgoCD Application Sets, Wave and Velero.

He explains their journey from managing informal environment sharing between teams to implementing a scalable preview system that reduced environment creation time from 19 minutes to 25 seconds.

You will learn:

How to implement GitOps-based preview environments with Argo CD Application Sets and PR generators for automatic environment creation and cleanup.
How to control cloud costs with TTL-based termination and FIFO queues to manage the number of active preview environments.
How to optimize data seeding using Velero, AWS EBS snapshots, and Kubernetes PVC management to achieve near-instant environment creation.

Relevant links

Transcription

Bart: In this episode of KubeFM, our guest, Nick Nikitas, a senior platform engineer at Blueground, shares his story about the challenges he faced with static pre-production environments, how his team tackled performance bottlenecks, and why deployment previews have transformed the developer experience. We'll discuss questions like what led Blueground to rethink static pre-production environments, how they navigated the costs and complexity of multiple environments, and what unique solutions emerged for implementing Kubernetes-based previews using GitOps, Argo CD, and Velero. This episode is packed with lessons on balancing cloud costs, reducing friction in deployments, and crafting a scalable strategy for production-like previews. If you're looking to optimize your Kubernetes setup or just stay up to date with cloud-native tech, you'll definitely enjoy our conversation. This episode is sponsored by Loft Labs, the creators of vCluster. How are you managing multi-tenancy in Kubernetes? Namespaces can be insecure and lead to noisy neighbor issues, while provisioning separate clusters is expensive and an ops headache. vCluster solves this by running secure, isolated virtual clusters on your existing infrastructure, saving you time and money. Virtual clusters launch in seconds, offer better isolation, and scale effortlessly at a fraction of the cost. Try it for free at vCluster.com. Now, let's check out the episode. So, Nick, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Nick: One of the tools I keep an eye on is Karpenter, a Kubernetes native auto scaler developed by AWS. It offers advanced scaling capabilities by automatically launching the right compute resources for your workloads based on their specific requirements. I believe that after Karpenter's graduation from beta, its new features regarding disruption regions and consolidation control will allow users to implement better fine-tuning towards improving the balance between cost and application availability.

Another tool I'd like to mention is Traefik, an open-source cloud-native load balancer commonly used as an ingress controller in Kubernetes clusters. Traefik supports SSL termination and uses service discovery to dynamically configure routing, which simplifies traffic management across microservices.

Finally, I've been interested in Argo Rollouts, a Kubernetes controller that provides advanced deployment strategies for managing the rollout of applications in a more controlled and progressive manner. Since it supports sophisticated features like blue-green deployments and canary deployments, I believe it brings a lot of value to production environments, allowing for safer, more gradual releases with the ability to pause, monitor, and roll back changes based on real-time data.

Bart: Now, what do you do and who do you work for?

Nick: I'm a Senior Platform Engineer at Blueground. Blueground is a Greek company that specializes in offering fully furnished, flexible rental apartments for medium to long-term stays with flexible lease terms, operating in numerous cities worldwide. Our platform team at Blueground manages the cloud infrastructure that allows development teams to build, deploy, and scale applications smoothly. It is responsible for the CI/CD pipelines, the monitoring and observability of the overall system, as well as improving the developer experience by providing several internal automation tools.

Bart: And how did you get into Cloud Native?

Nick: During my studies for my Master of Science in Machine Learning and Data Science, I experimented with Docker and Kubernetes. I decided to implement a thesis that combined big data technologies like Apache Spark with several DevOps tools. I became increasingly interested in cloud technologies and the process of deploying and orchestrating distributed systems. This led me to dive deeper into cloud native systems, focusing on tools and technologies around containerization, infrastructure as code with tools like Terraform, and automation with CI/CD pipelines.

Bart: Now, the Kubernetes ecosystem moves very quickly. How do you stay up to date? What are your go-to resources?

Nick: Staying up to date in the Kubernetes cloud native ecosystem is crucial because it's constantly evolving with new tools. To stay current, I regularly visit the blogs of cloud providers like AWS. I also read the topics of conferences like KubeCon and listen to the presentations that interest me, to get a better understanding of possible tools I'd like to explore. Additionally, I've subscribed to several newsletters, such as Kube Weekly, to receive a short list of important news on a regular basis.

Bart: Now, if you could go back in time and share one career tip with your younger self, what would it be?

Nick: Well, I would tell my younger self to actively seek out mentors and communities, because I believe that learning from others' experiences and sharing knowledge in the community can accelerate growth and open up opportunities I wouldn't have found on my own.

Bart: Good advice. As part of our monthly content discovery, we found an article you wrote, titled "98% Faster Data Imports in Deployment Previews." In 2022, Blueground faced challenges with their static pre-production environments. Can you walk us through what those challenges were and why they became a problem?

Nick: Although we had one to two dedicated pre-production environments for each engineering team, now they have a Kubernetes cluster, we noticed that there was an informal landing of environments between teams. There was often miscommunication between them, where there was a need for ad hoc debugging. Also, the developers in the team would start implementing several features at the same time, so a static environment would end up being feature-specific. This structure was not scalable at all.

Bart: What options did you consider to solve this, and how did you arrive at your final approach?

Nick: One way we considered to tackle this problem was to spin up more fully-fledged reproduction environments per team, but this would result in higher cloud costs since we would have to allocate more resources that would be available 24-7. In addition, having these static environments shared between more teams would worsen the issue regarding how they could do their work in a frictionless manner without blocking each other's future development. We decided to implement our own flavor of deployment previews, specifically in Kubernetes, inspired by platforms like Vercel and Netlify. The decision was made because we believed that having an ephemeral self-serve environment would greatly improve the developer experience, as well as reduce maintenance and cloud costs.

Bart: Implementing deployment previews on Kubernetes sounds pretty complex. Can you tell us about the tools and technologies you considered and why you ultimately chose GitOps and the specific tools you did?

Nick: We realized that creating a custom solution for deployment previews would require a significant implementation effort. One possible solution was to utilize Terraform, with a Terraform apply and Terraform destroy on the fly. However, since we had already adopted Argo CD for a GitOps approach in our tool stack, we considered Argo CD Application Sets, particularly the pull request generator feature, which was well-suited for our case. An Argo CD Application Set is a Kubernetes Custom Resource Definition (CRD) that automatically generates Argo CD applications. In our stack, we think of an Argo CD application as a fully-fledged environment that hosts the whole product. The PR generator was also convenient, as it could automatically discover open pull requests within a repository with a certain label and generate an Argo CD application.

Bart: Now I do have to ask one question that's not on script. Did you ever evaluate Flux, and if so, what was the deciding factor that made you choose Argo?

Nick: Well, from the start, we read a lot of talks about Argo CD and how easy it is to integrate it in your tool stack, and worked with infrastructure as code for GitHub's principle. So we decided to just use Argo CD instead of Flux.

Bart: That's a great use of existing tools. How did you handle the actual deployment process and ensure everything was set up correctly?

Nick: Apart from the information on ArgoCD features, we also used Argo CD Waves for 20 server-ordered deployments. We wanted the preview environments to be as generic as possible. For example, for some environment varieties of specific applications, we had to establish conventions based on namespaces, which made them pretty dynamic. The process of container image building and pushing to AWS ECR is handled by GitHub Actions. After the pipeline completes, the developer waits until the Argo CD application is healthy. A comment is added to the PR with information on the deployment result, which includes either the available points of the services or a failure message.

Bart: You mentioned using Argo CD Wave for your deployment approach. Can you elaborate on how this works for people who might not be familiar with it, and why it's beneficial for your preview environments?

Nick: Argo CD supports executing sync operations in steps. Within a sync phase, you can have one or more Argo CD Waves that ensure certain resources are healthy before subsequent resources are synced. This feature can be combined with resource hooks, such as pre-sync or post-sync hooks. This feature played a crucial role in our case, as we wanted to deploy our self-hosting data source, wait until they are healthy, then seed them with de-identified production data, and finally deploy our services.

Bart: Now, managing these preview environments must have been challenging. How did you handle environment termination to keep costs under control?

Nick: Managing preview environments was definitely a challenge, especially in keeping costs under control. The key question was figuring out the right time to destroy the deployment preview. Initially, we relied on the ArgoCD application controller, which would automatically terminate the preview environment when the associated Pull Request was either closed or unlabeled. However, this alone wasn't sufficient to address our concerns, particularly with long-running or forgotten PRs. We needed a more versatile termination strategy to further reduce costs and limit the number of long-running PRs.

To address this, we introduced additional conditions: a TTL of 48 hours for each preview environment and a FIFO queue that enforced a global limit on the number of active preview environments at any given time. We set up a Jenkins pipeline job that runs at specific intervals, removing the label from the open pull request when either the TTL or the capacity criteria are met, which then triggers the preview environment's termination.

Bart: Sounds like you had a solid system in place. However, in 2023, you mentioned that you identified some performance issues. What were these issues and how did you discover them?

Nick: At the end of 2023, we conducted user interviews where our colleagues shared their biggest pain points regarding developer experience. We found that the time required to have an available preview environment had become a serious performance bottleneck, taking more than 20 minutes to complete. Some people told us that they sometimes refrained from using them when they were in a hurry. The main issue was the actual seeding process of anonymized production data in our ephemeral environments using pg_restore commands. Soft scaling proved challenging, as data import duration increased proportionally to the ever-increasing size of our databases.

Bart: Quite a substantial wait time. What was your initial approach to addressing this performance bottleneck?

Nick: We explored several potential solutions to optimize the data import process. As the first potential solution, we considered using Kubernetes StatefulSets and the Container Storage Interface (CSI), specifically a CSI Snapshotter. Since we have self-hosted data stores in our pre-production environments, we utilize Persistent Volumes and Persistent Volume Claims to dynamically create a cloud-provisioned volume through the AWS EBS CSI Driver, which is AWS EBS in our case. We thought of taking a snapshot of a PV, and on every restore, we would just refresh the PVC of the data stores with new data from this snapshot. However, we encountered a problem - a limitation that the snapshots are namespaced. This meant we couldn't restore a snapshot in a new environment, which would have fit our case perfectly.

Bart: Given that limitation with PVC snapshotting using CSI Snapshotter, how did you eventually solve the problem?

Nick: We finally heard more about Velero, which has the flexibility to back up any Kubernetes resources and store them in a storage system like AWS EBS, and provides a seamless restoration in any chosen namespace. We thought it would be a perfect candidate as a solution, so we deployed Velero using Helm in a Kubernetes cluster with some required plugins through its Helm server. We also used an AWS EBS CSI Driver to manage the lifecycle of the EBS volumes, as well as a CSI Snapshotter to capture events regarding volume snapshots and volume snapshot contents.

Bart: That also sounds like a significant change to your infrastructure. Can you walk us through how this new system populates data in your preview environments?

Nick: We have a process in place that anonymizes and exports data to AWS S3. We created a new isolated namespace where a scheduled data export job is executed, and then an EBS snapshot of the PVCs and PVs is performed using Velero CLI. This namespace serves as the base for each volume restore operation. When a preview environment is spun up, a new ephemeral environment is created. We use ArgoCD Waves to create everything in an ordered manner. The first step is a Kubernetes job that triggers the Velero restore operation using the generated snapshots from the existing backup volumes in our base namespace, and then provisions the relative PVC and PV in the target ephemeral namespace. In the next wave, our data sources are provisioned, and once they are healthy, our applications are deployed.

Bart: That's an impressive optimization. What kind of improvement did you see in terms of performance?

Nick: With all these changes, the introduction of new tools allowed us to reduce the data import operation time from 19 minutes to a maximum of 25 seconds. This optimization had a huge impact on operational efficiency for our teams, as it dramatically increased their productivity and enhanced the developer experience. As a result, our teams now strongly prefer using our preview environments, as they can achieve much more in less time.

Bart: Those are fantastic results. Now, looking back on this journey, what would you say are the key takeaways for other teams looking to implement or optimize their own deployment preview systems?

Nick: I think the biggest takeaway from this journey is the importance of cross-domain collaboration, involving teams across different domains, such as engineering, DevOps, and product. This ensures that we uncover and address pain points that might not be visible in isolation. Feedback from users is also essential, as they often identify inefficiencies or areas for improvement that we may overlook from a technical perspective. Another key lesson is the value of leveraging existing tools rather than reinventing the wheel. For example, in our case, we utilized tools like Argo CD and Jenkins that we had already integrated into our stack to create a more robust system that had a huge impact on the overall development process for our teams.

Bart: We checked out your LinkedIn profile and it looks like you're quite a fan of Rust. Is that correct?

Nick: It's been really fun diving into a new language after a few years. Over the last few months, our Platform team has been integrating more and more internal tools written in Rust. Rust has become a key part of our tech stack due to its performance benefits and memory safety guarantees. One of the most exciting projects we've worked on recently is Rust Witcher, a new data anonymization tool that we developed from scratch using Rust, which is now open source. If you're interested, we've published an article about it on our Blueground Engineering Blog for Blueground, where we dive deeper into the technical details.

Bart: What's the best way for people to get in touch with you?

Nick: I use the LinkedIn platform very often. So feel free to get in touch with me there.

Bart: Nick, thank you very much for sharing your time and knowledge with us today. We look forward to crossing paths in the future. Cheers.

Nick: Thank you very much, Bart. It was my pleasure.

Bart: Take care. Cheers. Bye-bye.

Listen anywhere