The Making of Flux: The Origin

The Making of Flux: The Origin

Host:

  • Bart Farrell

Guests:

  • Alexis Richardson
  • Andrew Martin
  • Chris Aniszczyk

Join the Flux maintainers and community at FluxCon, November 11th in Salt Lake City—register here

This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.

You will learn:

  • How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.

  • How Flux sustained continuity after Weaveworks shut down through community governance.

  • Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.

Relevant links
Transcription

Bart (Host): In the world of cloud-native infrastructure, few stories capture open source resilience like Flux. Born as a Weaveworks side project to tame container chaos, it became the blueprint for GitOps, treating Git as a single source of truth for everything you run. Today on CubeFM, we peel back Flux's arc from hackday code and CNCF graduation through the shock of its founding company shutting down and into a renaissance under new stewardship. This isn't just a technical tale. It's human. Maintainers catching CVEs on Christmas. Banks and telcos betting production on an open repo. A community deciding whether to watch a project die or fight for it. Guiding us in this first episode are three voices: Alexis Richardson, Weaveworks' co-founder and the man who coined GitOps, Chris Aniszczyk, the CNCF leader who marched Flux from sandbox to graduation, and Andrew Martin, Control Plane CEO who stepped in when Flux needed a new home. We return to 2015, the Kubernetes wild west. Alexis Richardson: We had been thinking about what we were trying to solve with Weaveworks. Actually, we believed that the mission of the company was to help developers build applications and we could see that containers were going to be key critical to doing that. We'd learned that VMs were not a good way to build applications in the cloud systematically and at scale for various reasons. But we could see with PaaS platform as a service and Docker and Kubernetes a different path emerging around containers. But the basic problem of containers is that they are really operational tools. They're not something that necessarily is particularly developer friendly. And we could see Docker coming in and creating a developer wrapper around the container. That was very exciting indeed. With Weaveworks, we wanted to fill other gaps that application developers needed to fill in order to manage an application and we thought the most important challenges that people were having were around first of all networking that led to the creation of WeaveNet and from there we wanted people to see how their application was constructed. So we built something called Weave Scope which is a monitoring and management visualization tool and then we wanted people to actually do things with their applications. So we tried various different approaches to adding services to networks and to application environments using Docker Swarm, using some of the Mesosphere tools, using Kubernetes. And gradually we realized that this was going to be... gradually we realized that we needed to run this as a SaaS in order to give people value to these functions we were building. Running it as a SaaS meant that we had to deploy an actual containerized stack and we were one of the very first companies actually to do this. I remember that Zalando in Europe had built one, obviously in GKE. Not many other people were building production container runtimes at that time and we had to choose which tools to use and started building our own and soon found it was just way too complicated. So we decided after some debate to pick Kubernetes as the runtime that we would run our own SaaS on and then the SaaS was going to provide capabilities to application developers like I described. So networking capabilities, visualization, management, monitoring capabilities for these application tools. So you can sort of see we're feeling around here. But while we did that, we had to deploy this Kubernetes tooling and a tool was built for deploying Kubernetes and for deploying applications to Kubernetes which was based on the principles that we laid out identifying Flux. Bart: Turning Flux from prototype to powerhouse meant hardening and that starts with scale. Alexis returns to recount how they dog-fooded Flux across multiple clusters. GitOps is an interesting question. Alexis Richardson: We had got to the point where I think the team was successfully using an internal version of Flux as a deployment tool in the way that Flux is now recognized with this reconciliation loop design. And I remember very vividly sitting on a summer afternoon a bit like this one in London today. It was a very peaceful day. And I heard one person in the corner of the office say, "When I press this next key, I could probably wipe out our entire systems of our SaaS and everything else." And so I went into kind of Matrix movie slow motion mode and was reaching across into the middle of the room to say no. But it was too late. So I heard the click and then about half a second later the very distinctive "oh." And then he said "I wiped out our entire production systems." So what was interesting to me was that the team then sprang into action and again a bit of Google kind of DNA here went straight into SRE crisis mode with a squad dedicated to getting the site back up and they managed to get it up all of it up and running again in about 45 minutes which I thought was pretty amazing because it was quite a complex thing at the time. It was much more complicated than it needed to be and we pulled things together to get Kubernetes working that we shouldn't have done and there were bits and pieces of Amazon here and there and it was just a mess. So I was amazed that it was back up in 45 minutes. And afterwards I said, and you've got to realize I know nothing about this, so 45 minutes seems like a quick time to me. But I said to the team, how did you do that? That sounds pretty an accomplishment, something that we should be interested in. Is it significant? And they said well it is because we're following this practice of everything that we do for the site is sequentially deployed from one place. The place that we're using is GitHub. But you could use other git implementations or even other source control implementations. But it has the property that it's a persistent source of record and system of record, a source of truth. It is something where it understands sequences of things and versions and we can also annotate changes on our commits with metadata like this is a patch I'm rolling out on Tuesday because I need to or this won't work the bug needs to be fixed by next week or other useful stuff that people can understand when they're redeploying the system. Bart: Deployment pain wasn't Weaveworks specific. Every ops team felt it. Flux escaped the lab and the ecosystem took notice. The CNCF, the Cloud Native Computing Foundation, home of Kubernetes and Prometheus saw potential. The graduation demands more than just clever code. Chris Aniszczyk, CTO of the Linux Foundation, explains what Flux had to prove to earn that elite badge. Chris Aniszczyk: Yeah, sure. So CNCF has a set of maturity levels for all of our projects, right? We have sandbox early stage kind of experimentation level, then it goes on to incubation which is kind of stable, there's probably some multiple companies involved in the effort, there's a clear governance model, then graduating graduation which is kind of your this is a project that a company could bet its business on. It's a fairly stable open source project, exhibits the qualities of and ingredients of what a successful open source project is for the long term. And Flux went through that evolution over its history in CNCF as early days. I think when it came in as incubation around I think 2021 which it was mostly I would say it wasn't like it was definitely like Weaveworks heavy for sure right like it was the origin company so there was definitely they did the majority of the work but there was a lot of contributions from different organizations. And then when the project graduated in late 2022 it basically had to demonstrate that hey are there multiple companies involved, multiple maintainers from different companies involved, at least two. They did that. They had a security audit, I think they actually had two security audits at that point so they demonstrated that they were adaptable to any security issues as they arise and those are kind of the requirements and it hit that level and formalized its graduation in late November 2022. So not many projects get to that level. It's probably about 10% of CNCF projects sit at the graduated level. So it's kind of a very unique place to be. Bart: Security conquered, adoption exploded. Banks and telcos weren't just users, they were dependent. Alexis explains why declarative pull-based delivery became a compliance superpower. Alexis Richardson: One of the things that surprised me about Flux and GitOps is that regulated industries actually liked it in a way that I wasn't anticipating. And the reason they liked it is because we drew this chart in an early blog post which showed the kind of classic DevOps sideways figure of eight, the infinity symbol which is normally meant to flow sort of represent the dev cycle and the ops cycle interlocking and flowing around each other or something continually. And I drew a line down the middle which was a dotted line. I labeled it the idempotency barrier because the idea was that if you are in production you can pull things from the design source of truth idempotently but you can't trash the system by pushing things into it. And later on I had loads of conversations with people including Gartner started using that picture. Big banks like JP Morgan, Citigroup came to me and said we really like this picture. This tells us exactly the kind of philosophy of dividing production from dev that we want to have in a world where we sometimes need to fix production. So that was great. The only problem was that when we designed GitOps, we didn't stop to think about approvals and workflows stuff like that. So we're just completely silent on that stuff. So those are things people have built over the top in enterprise products. So for example I think late to your previous question enterprise requirements I would say that for example pre-deployment hooks and post-deployment hooks in Flux is a nice way of introducing validation into what should be a continuous flow and in fact you could do a pre-deployment hook that blocked on somebody going to look inside Jira to see if a ticket had been completed if you really wanted to be pretty hardcore. But that would be one way that you could introduce approvals. What typically does happen is that they're sufficiently complex that people embed all that stuff in GUIs. Weaveworks had a GUI, an enterprise GUI for such things and you can get it in Harness and tools like that. Bart: Delivery patterns keep evolving from machine learning pipelines to air gap factories. Chris shows how Flux is stretching far beyond simple app deploys. Chris Aniszczyk: Yeah. So GitOps is like, GitOps was something that kind of existed for quite a while in terms of like hey we are versioning declarative artifacts in a git repo basically that gets like continuously reconciled and the state of things gets that desired state. Like this type of thing has kind of been happening since we've had like old school configuration style languages from kind of the Puppet, Chef days. There was bits of GitOps in there. But I think what Flux really did before anyone else is really pushed kind of like modern, really what are known as like the modern GitOps principles, right? You have things have to be declarative, right? They have to be versioned and immutable, preferably like in a git repo. Things are pulled automatically, right? You have systems or an agent whatever that checks for that kind of desired state and then things get reconciled right because there's always like drift and things change and so that's the whole kind of GitOps thing and Flux was I think one of the kind of modern very Kubernetes native approaches that really helped push these GitOps principles a lot further across the ecosystem right. And obviously in CNCF we have a multitude of approaches to do kind of GitOps style things, right? There's things like Flux and Argo there, but we also look at things like Crossplane. Crossplane is very much a GitOps style thing. We recently even accepted OpenTofu in CNCF, right? The Terraform fork. That's a whole other kind of GitOps style system. So I kind of view Flux as kind of the father of modern GitOps and really kind of showing how to do that in a very Kubernetes native fashion. That's kind of how I look at both Flux and kind of the original principles of GitOps. Bart: Only around 10% of CNCF projects reach graduation, a milestone that signals technical maturity, community strength, and long-term sustainability. But for Flux, that milestone was just the beginning of a much bigger test. At some point, Weaveworks entered financial difficulties, casting a shadow over the future of Flux. At that point, Control Plane decided to step in to ensure the project's longevity and sustainability. Let's hear the story from Control Plane CEO Andy Martin. Andrew Martin: Control Plane has been a huge fan of Weaveworks since day zero. They moved the needle. They launched the first cloud-native networking component with WeaveNet that would bust through firewalls that made deployments very easy. They launched GitOps. They launched Firecracker and virtualization based ways of bootstrapping clusters. They launched kubeadm. They launched EKS D. Control Plane's based in London just down the road. Some of the most incredible engineers working there. And we supported that mission. We wrote hardening GitOps white paper. We put assurance time into helping them get through Flux v2. Personally, I worked in TAG Security supporting the graduation of Flux into a graduated CNCF project as declarative paradigm. It reduces the attack surface for an application. The way that Kubernetes works, we pray, we position our hopes and dreams, give them to the orchestrator, and hope that magically they'll be reconciled. And the guarantee of Flux is it takes all of the toil and effort out of that process. We've always loved Flux and when very sadly Weaveworks ran out of runway, it was beyond obvious to us that we should say here's some financial support for the project. Go ahead and make sure this is sustainable, safe, there's confidence in the community. And those conversations developed into well how sustainably can we build Flux? And the answer that Stefan and I came to is we can build an enterprise offering based on Control Plane's reputation working with highly regulated organizations and we can sell support for that as a pure open-source upstream project. We're selling Control Plane's expertise, passion, and long-term commitment along with the guarantee that the way that you deploy GitOps when you use Flux to best practice will be best-in-class, highly optimized, and the most secure way you can deploy a Kubernetes cluster. Bart: Let's look at the transition from Weaveworks to Control Plane from the CNCF perspective with Chris Aniszczyk. Chris Aniszczyk: Yeah. So what happens with CNCF when a project either goes through a turbulent time or something happens with supporting companies is CNCF is generally kind of put in the middle kind of like I would say a mediator or a facilitator to ensure that the project remains stable and when you have a company that has to wind down and they have the majority of maintainers there, generally we prefer to kind of work with them and try to be as transparent as possible with the community when this happens. So we got a heads up that this was happening. We basically usually make a call at our board level or technical board level for like, hey, this is happening. We're going to do a call for maintainers. There's any companies or folks involved that want to go step up and help out. So we've done that plenty times in kind of our history in CNCF. We've done it for Spinnaker in the past when CD was kind of short staffed and we had Amazon and Red Hat stepped up. In this case, similar thing that kind of happened here. What was a little bit unique and different is I think Control Plane stepped up very very quickly in terms of hiring the actual maintainers in the project and I think that was a little bit unique compared to previous situations where a company just steps in and it's like all right we'll get some maintainers on the project versus Control Plane coming in and quickly hiring obviously the talented engineers like Stefan and then folks that kind of helped drive that project. They were pretty quick about doing that. And so by doing that, that gave the project stability because at the end of the day, maintainership for any CNCF project is held by the maintainer, not the company. So if the maintainer goes to a new company or starts their own thing, the maintainership sticks with them. Bart: Which begs the question, what's next? Andy closes us out with Control Plane's roadmap and how they plan to keep Flux both innovative and boring. Boring not in a bad way. What we mean by that is reliable. Andrew Martin: There are fundamentally three work streams ahead. The first and most important is ensuring that the core maintainers have the ability to continue developing upstream Flux and supporting its roadmap based on community engagement. The CNCF Flux project remains the foundation of everything that we're doing, ensuring stability, innovation, and long-term viability for the project. This means prioritizing contributions that improve end user experience, security, performance, usability whilst keeping the project aligned with the ever evolving needs of the cloud-native ecosystem. Control Plane is committed to fostering a strong diverse community of contributors, enabling sustainable development and maintaining the CNCF governance model that keeps Flux an open-source and universally trusted project. The second work stream began at a time when changing interest rates led to a shift in venture capital from investment in infrastructure to AI. Nevertheless, privately funded, we commenced development of the Flux operator to enhance the upstream project in alignment with our core open-source first principles and we released the operator under an AGPL license. This provides workflow capabilities, quality of life improvements in the Flux project, particularly for massive scale hub and spoke or sharded deployments, but also robust enhanced security controls in line with Kubernetes philosophy behind the admission controller to support deployments of all sizes. Continuing to develop the Flux CD operator remains a key focus as we add value to the project without polluting the core open-source Flux CD project itself. And the third work stream is of course continuing to invest time, effort, and love into Control Plane Enterprise for Flux. This is our response to the needs of regulated industries that require Flux CD to operate at scale with a hardened enterprise-grade distribution that meets compliance and security requirements now and tomorrow. To support this Control Plane Enterprise provides 24/7 global support, ensuring operational security and reliability for organizations that rely on Flux in a business critical environment every day. Bart: A deployment script born from frustration became a CNCF graduated project that powers mission-critical infrastructure worldwide. Flux's journey proves that with resilient governance, shared ownership, and relentless community work, open source projects can outlast the creators and keep evolving. But this story is just getting started. In the next episode, we sit down with Michael Bridgen, who wrote Flux's first lines, and Stefan Prodan, the maintainer who rebuilt it into Flux v2. They'll reveal why Flux nearly merged with Argo, how the Flux operator tackles day-2 chaos, and what innovations are coming up next. Until then, keep shipping and keep it declarative.