Surviving multi-tenancy in Kubernetes: lessons learned

Surviving multi-tenancy in Kubernetes: lessons learned

Host:

  • Bart Farrell

Guest:

  • Artem Lajko

This episode is sponsored by Learnk8s — become an expert in Kubernetes

Is sharing a cluster with multiple tenants worth it?

Should you share or have a single dedicated cluster per team?

In this KubeFM episode, Artem revisits his journey into Kubernetes multi-tenancy and discusses how the landscapes (and opinions) on multi-tenancy have changed over the years.

Here's what you will learn:

  • The trade-offs of multi-tenancy and the tooling necessary to make it happen (e.g. vCluster, Argo CD, Kamaji, etc.).

  • The challenges of providing isolated monitoring and logging for tenants.

  • How to design and architect a platform on Kubernetes to optimise your developer's experience.

Links:

Read the transcription

Bart: Multi-tenancy in Kubernetes refers to running applications from multiple tenants on the same cluster. In order to do so, engineers leverage tools like Argo CD or Flux, depending on your situation and preferences, vCluster and external DNS in order to be able to get that shared cluster with multiple tenants. We got a chance to speak to Artem, who works for the Hamburg Port Authority, and he shared with us his experience of what it was like doing this. And mind you, in a team that's quite small. Before we get into the episode, I'd like to thank our sponsor, Learnk8s. Learnk8s is a training organization that helps engineers level up in their Kubernetes career and their journey through courses that are either online or can be in person. Courses can either be public or private instruction, and they're 60% hands-on and 40% theoretical. Like I said, you can join from the comfort of your own home, or you can be in the courses in person, and you'll have access to the materials for the rest of your life. So you can keep revisiting that Kubernetes knowledge in case you decide that you want to brush up your skills. That being said, let's jump into the episode and hear about Artem's experience. Artem, welcome to the KubeFM podcast. Very nice to have you with us today. If you have a brand new Kubernetes cluster, which three tools would you install first?

Artem: I would install first cert-manager because I think it's essential for automating the management of TLS. If you're building MTLS certificates, it's a crucial communication to secure it. And I love cert manager. And if I deploy a new cluster, the first of my tools are cert manager. Then I also deploy today Argo CD for the declarative GitOps approach. to delivery tools is essential of my point of view for platform engineers to develop the platform context also to develop the customer software over the developer teams. And then the third tool, I think it's going to be Kyverno as Police engine lets you to manage and enforce policies across the cluster without writing complex admission control webhooks. And it's a nice advantage. So I think that are three tools that I would use.

Bart: All right, just out of curiosity, you know, every time a tool is chosen, there's maybe another alternative out there. Have you tried Flux when we're talking about GitOps? Or have you also tried OPA, Open Policy Agent, or OPA, depending on who you're talking to, on the policy side?

Artem: Yes, both. I tried to use Flux. If I started working with GitOps, this was my first tool. And then I switched to Argo CD. And OPA I already used two or three years ago because you not can only handle your Kubernetes stuff. You also can do it for your VMs and everything like this. You have to write your Rego rules and then you can apply it. But on Kubernetes, it feels more like... through the chest we call it in germany it's like yeah you rewrite and rule and you can't really have webhooks at the moment yes they have a lot contributing work and they are like kiwano but not the same you can do a lot but first it was you write a rule and you can apply or deny this was my experience with opa and so because of that i switched to Kyverno because it's easier for the developer because i'm also working as a developer enabler for kubernetes and They only have to know Kubernetes manifests. And because of that, I will choose Kubernetes today over OPA as a policy engine.

Bart: And in terms of your work, can you just tell us a little bit more about who you are, what you do, where you're working?

Artem: Yes, who am I? My name is Artem Lajko. I'm from Germany, living in Dortmund. Most people don't know Dortmund, but maybe you hear about it. We have a football team, Borussia Dortmund, B4B. Then we have companies like Adesso, founded in Dortmund. And we have a really high reputed technical university, TU Dortmund. Yeah, where I also have my master's degree from, with focus on software development. I have a master's degree in computer science with software development, infrastructure, and hardware engineering as topics. I started my cloud journey, I think, around about 2016. We have a project for object detection with deep learning during my bachelor degree. And at this point, my cloud journey starts. And most people ask me, You're probably wondering how statistics and Python are related to the cloud-native world and containers in 2016. And I said, yeah, we have the challenge of attaching GPU cores to processors. to procedures and providing the data course for each student in the research team. And we use at the moment, I think we use ChangeRoute. It works semi-well as a Linux tool to isolate processes. But at the time, Docker was breeding as a container engine. And we say, hey, let's write as a Docker wrapper and use a Docker engine to isolate it. And it's worked very well for this approach. And I see, wow, it's really nice. I look into deeper into Docker and see how I can use Docker for different topics, for different approaches. But I found really fast the limitation of Docker at the time, such as port management, kernel sharing, and dependencies, no horizontal scaling if you're working on a host. um yes and because of that i then deep dive into docker swarm but docker swarm was in pain i think it was 2018 i looked for orchestration solution tried docker swarm and realized that for the tool it was nice at the moment then i appreciated kubernetes the hard way to build your own pke and things like this and then it's early but it's don't end up for me with Kubernetes. I started to work with OpenShift 3.11 from Red Hat, which is based on Kubernetes or is a downstream project from that. And I learned hands-on with OpenShift with the 3.11 version. And then I also working with OpenShift 4, which is very different approach, but also nice to know. And then I think 2019 or 2020. I am deep dive into public cloud and use managed Kubernetes. And it was really nice because the focus changes from, yeah, I can run Kubernetes and nodes rotate them, what I'm going to build. And then it's changes to what I'm building now. It's more like I'm building a platform for the developer. So the internal developer or platform engineering, you can call it. how you like it, different teams, different use, different terms to describe the same things. And this is where I start to build with the developer approach. What we know today as an internal developer platform to act as an enabler for self-service for developer and support them with new tools. If I have to. I think it's a large introduction, but let me briefly summarize my career because I think also the people want to know how I learned the things, which way. And I have therefore always proposed a hands-on approach, failed quite often, sometimes more often on productive environment. Some customers don't think it was greed at first, but they were super grateful after because they become pioneers in these topics. First, they say, hey, new technology, I don't like it. You don't have much experience. By then, they get pioneers and they like it. There was no other way at the moment, no proper documentation or courses, hardly any experiences to get on the market. So there were nothing else. But I tried out, but I saw the failures more as a learning, not as a failure. And I also used... various learning platforms such as Udemy, KodeKloud, books, blogs to expand my knowledge or to understand why I have just deployed an anti-pattern at the moment. I personally see certification as a snapshot and out of I have certification like CKA, CKAD, Azure Solution Architect Expert and more. I would never call myself as an expert. I rather use them as motivation to challenge myself and to think outside of the box. So this is my way and this is a short introduction how I'm landing in the cloud native world and how I learn. I think it's going to give the people some insights.

Bart: Fantastic. No, definitely plenty of insights there. I don't know how you get the time for that, but we'll get to that at the end. Because it does require a significant amount of time and how you manage that. But last but not least, in terms of looking at your career path, if you could go back, let's say when you were talking about, let's say 2018, 2017, if you could go back and share one career tip with your previous self, what advice would that be?

Artem: I think I would start earlier with using managed cloud services like Azure Key Vault to focus more on the productive work and now try to keep them alive and maintain all the services you need. I think I would do it earlier than I started because I started the hard way on-premise data centers where you have to deploy everything self-hosted and maintain it. So this would be, I think, a tip to my younger self.

Bart: No, that's good. Now we invite you today because you wrote an article about Kubernetes and multi-tenancy. Just, you know, before we dive too much into the details, can you explain what multi-tenancy is in Kubernetes and why you needed a multi-tenant cluster?

Artem: In my opinion, multi-tenancy in Kubernetes refers to the concept of running applications for multiple tenants, different teams, apartments, even clients or customers with the same Kubernetes cluster. But I think this definition hits very well for two, three years ago. I think today multi-tenancy is more like an approach to provide further dedicated clusters via managed clusters. But where the control plane runs on the managed cluster as a tenant, as Kamaji or Kubernetes platform, I think it's called KKP. So you can bring your own node pool. So I think the multi-tenancy approach changed how we see it two or three years ago and how you can establish it today. I think it's very... important for teams if you have reduced resources or you have some sustainability topics from the app of your company it's maybe be the right solution for you or if you like us want to give the newbies our cold start it's also maybe good solution because we are working ticket based and ticket basis is a pain. So you have to wait sometimes one to one or two weeks to get a new Kubernetes cluster on our on-premise infrastructure. So we are using this approach to give the newbies the same experience as on the dedicated cluster and to work, to start very early and they can destroy it. So I think for this multi-tenancy, maybe a reason to use it. Okay.

Bart: And what tools? Yeah. Yeah. And so when you were at the, you know, working with the Hamburg port authority, what were the tools that you use to share your cluster with several tenants?

Artem: We are using Argo CD HA, the high ability version to manage the platform context and the feeds like external DNS, cert manager. In both our infrastructures, we have two infrastructures. We have one infrastructure on-premise. We use it as Kubernetes, distro and in the Cloud we are using Azure. Kubernetes service, AKS. And Argo CD allow us over adding cluster to build different Kubernetes cluster service catalog because an external DNS working on different as in the cloud to deploy our context to different endpoints like Azure. So we're using this as our in our tool stacks and we are using Argo CD core instance for every team. Every team gets the Argo CD core instance. to deploy the open application. This approach we are using on the shared cluster and also on the dedicated cluster. And then we are using things like external DNS, cert manager, ingress controller, everything you need to provide like web service that is available from outside. And then for the airbag things, we are using on-premise Rd groups. In Azure, we are using the Azure Active Directory groups. In the cloud, we are using Terraform to provision cloud infrastructure. And on-premise, we are using the old way, tickets, to get tons of namespaces to deploy it, guest cluster. This is our stack of reworks.

Bart: I mean, it's quite a stack. Can you describe the architecture of the shared cluster a little bit more in detail?

Artem: Yes, I can try it. Our shared cluster looks like vSphere with Tanzu, allows you to use node pools like the other distros, I think. I think every distro allows you to use node pools. And we're using to use node pools with labels and vCluster. We're using vCluster. Can be deployed on the specific nodes of a node selector. It's also not new to Kubernetes. And we defined a default node pool for all the tools that will be needed by the platform team, like the ingress controller, cert manager, external DNS. And we created a dedicated node pool for the developer, for the customer, or developer our customer. So every developer gets their own tenant with a dedicated node pool. This approach does a difference. by our dedicated clusters per project. Every team will get an Argo CD core instance as an init point, and they start to work like on the dedicated cluster. So they get only the URL and the credentials from the Argo core instance, and they start to work. So we treat our Kubernetes cluster like a managed service in the cloud. The developer don't need access to them. They have, but they don't need. And the workload between the teams are isolated. We don't use. Tains to separate our workflow. from the shared cluster because we trust Kubernetes and we have our default node pool which is empty and the Kubernetes scheduler should do the work. Yeah, we also reduce the stack on the shared cluster. We don't provide the same stack on the shared cluster, like on the dedicated cluster. We reduce it and we remove things like monitoring, because if you have to handle this part also, you need a team which are not doing. Nothing else as maintaining the multi-tenancy approach and we like to reduce it. Yes, and we only developed this approach because we work ticket basis, like I said, and it can take longer to get tons of names. And we wanted to enable a faster cold start and provide a learning platform for our developers, especially for the newcomers. And because of that, we don't need a very hardening tenant isolation. because we are using it as a learning platform and for non-critical workloads like documentation. If the documentation is down, yeah, it's not so nice, but it happens.

Bart: So in terms of creating this multi-tenant setup, how long did it take from start to finish?

Artem: I think it takes us four or five weeks because we are starting with planning, with proof of concepts. And before we start in Q3 cluster, we want to do it the native way with Argo CD and try to build a multi-tenancy with Argo approach. But it's not working very well because of different reasons like project separation. You have applications that have to be deployed in the Argo namespace. I know that Argo is working at the moment. to allow to deploy the application to another namespace. The problems with application are if they deploy it, they can change the project to default or anything else. You have to provide this. And so you can provide the same approach to our dedicated clusters. They will only get Git repository. And this will be the start, where they can create application and we don't like it. And then we switched also to Capsule from.. Yes, from classic. This is from classics like the Kamaji tool. We try to do it with classics by classics with capsule, sorry, but capsule working very well with flux and not Argo. And then we are learning by V class. And I think it's take us four or five weeks to build the first initial setups that are production ready for us as a learning platform of a non-critical workload.

Bart: And for teams that might be out there in terms of understanding this process, you know, and the rollout of a multi-tenant cluster, distributing the time that it took for planning as well as creating a proof of concept. Are there any tips that you'd like to share that people might want to keep in mind knowing what you know now?

Artem: yes of course if you are building a multi-tenancy approach you have to think about everything as a multi-tenancy approach so if you are providing monitoring with grafana or with the kube-prometheus stack you also have um to to have an idea how you realize or how you map your RD groups or your AAD groups to this approach, how you separate this. And for every tool where maybe you need a multi-tenancy approach, where you say, hey, I need at this point a harder isolation, you have to keep in mind this. And because of that, we reduce the stack. Because if you have things like this, it's going to be a... you have only disadvantages then in my opinion, because maybe you reduce your resources, your hardware resources by the engineering resources, maybe go two, three times up. I don't know, but it's, it's not worth then I think. So maybe you have to keep in mind if you need this hard isolation or if you need to keep in mind, you know, Give some money to a dedicated cluster and say, hey, this is easier for us. And we use it with one approach and don't combine it. But it also depends on your team size, on your skill level. There are a lot of topics and points you have to keep in mind.

Bart: With all the work that went into this, you know, the planning, the rollout, getting all the different stakeholders involved, serving these internal customers, was it worth it? It sounds like a fair amount of work.

Artem: I like to experiment, to try new approaches. As a learning path, yes. Or as a learning path, I would say yes. But if you use it to like or if you try this approach. to use for isolated workloads, really hard isolated workloads, not the soft multi-tenancy approach, but for the hard isolated, I think you have to keep more hours in. And I think they know because, yes, we're reducing our resource reduction, like storage. I think we saved approximately, I don't know, about the projects, maybe 200, 300 gigabytes. It's not much for the most companies. And we're also reducing the the wait, let me think what are we also reducing? The engineering time is increased, it's not reduced, it's increased because we have to maintain now two approaches and also a new tool like vCluster it's also changed, it's also CVEs And the developer experience was well for this approach because they say, hey, we don't get a difference between the dedicated cluster and the shared clusters. You have done a good job, but I wouldn't recommend this approach if you have a small team like us. We are only two people. So we have over 20 projects or I think 300 VMs, over 20 clusters. So if you try to maintain two approaches, it's not good. working very well also also you like to work over 40 hours a week like 16th and it's gonna be make work but it's not healthy definitely not and you know after all this work how did your customers you know the developers how did they perceive the new platform what kind of feedback did they give and

Bart: you know being a two-person team how did you keep up with their request once it had been you know delivered in order to improve the platform

Artem: They like it because there's a similar feeling on the dedicated cluster and the only difference they have is the login process against the cluster. So they have to login versus the Kubernetes cluster and then again the virtual cluster. But it's not necessary. And the requests we get, we handle it very easy. We're creating our templates so they can fork it. Then they can create a pull request. We approve it and Argo do the work because at the end we have Kubernetes manifest. And they do all the work like if they need a new virtual cluster for a project, then they create a ticket, get a day group that we can map and everything like this, create a pull request. We approve it and Argo CD does the work and they get a new cluster. If they need something specific like other tools, they have to deploy it by themselves. We don't deploy things like RabbitMQ as a message broker for them. They have to do it themselves. But if they need a new cluster, it's very easy. Fork, template, pull request, we approve it. And Argo, thanks, GitOps and the declarative way, Argo does the work for us. And keep in sync.

Bart: Good. And, you know, a controversial question, but more so just to figure out, you know, what are the pros and cons as there are trade-offs with every decision that has to be made. Would it have been easier just to have separate clusters instead of having a shared single cluster?

Artem: I think for us, yes. I think for us, yes, because we are only two people and it's a lot of work. We undisseminated it and it's really a lot of work to maintain a second approach. And for us, yes, it's going to be easy to have only one dedicated cluster approach.

Bart: And in terms of the next steps in the future, what are the plans around extending this setup? Do you envision the multi-tenant cluster evolving? What are your thoughts on that?

Artem: shut down the approach because we have an enterprise architect and the enterprise architect said hey we use only one approach we use a dedicated cluster approach and we shut down every multi-tenancy also if we waste resources at the moment it's a stable thing for us to have one approach that we maintain and that we use to scale with the projects now

Bart: being a small team uh two people In terms of the knowledge that has to be acquired and shared, you know, how do you keep up with all this? Like, I know in the beginning you mentioned about how you've learned, you know, Kubernetes the hard way. You know, you've looked at blogs, things like that. But in the process of building this, were you going to the Kubernetes documentation? Were you using Slack? How did you learn and keep up with all this?

Artem: um i used to keep up to date um i use blogs to see what's going on on the market and what different people companies are building because this is my opinion the fastest way to keep updated because to publish a blog is very easy as to publish a book Then I use things like I'm using conferences like KubeCon as an exchange also with my colleague and keep up to date about the trends, about what other companies are building, how they failed and what they are learning. And we try to avoid the same errors. And this is the things like I'm learning or have I try to keep updated. But the most things are working over blocks and over exchange with experts.

Bart: That's one thing. Another thing that I did want to ask, though, is that. You seem very efficient in terms of finding and consuming these resources. How do you organize your time as like, all right, on Mondays, I'm going to go read a blog. On Tuesdays, I'm going to watch a talk that's on YouTube. How do you organize yourself in terms of structuring the acquisition of these topics?

Artem: um very very easy for me because i have my working day and i also have from my company time to do something like in the working time so if i need a certification or something like this i say hey we don't have some urgent topics so maybe i can have a one half week to learn for things like this or if i don't have the time from the company I use I'm working. I think I start at seven, then I finish at 3 p.m yes i have to 3 p.m or sometimes 4 p.m then i make a rest a nap power nap then maybe i'm making some sports and then i starting learning from 7 p.m to 9 p.m so um this is how i work every day one to two hours and if you uh Having a flow, it's nothing special for the brain because it's like a workout to keep the new information. But you have also to suffer a lot with your private time. So sometimes my girlfriends say, hey, you. I am also here.

Bart: That's important. That's very important. Yeah. It's very important. No, but I think there are different things there. One thing is, you know, that organizations respect their employees and give them the resources that they need to do their jobs properly. And one of those things being time and also support regarding certifications. I also wanted to, just a quick question. You mentioned sports. What kind of sports do you do?

Artem: I'm doing like Freeletics or I'm going about I have a leg. I don't doing some extreme sports. I'm doing some to keep the body up to date. Like think like Freeletics. I also use ZAP. Okay. Like these things. All right.

Bart: All right, good. No, the main thing is finding something that works for you, just like whether it's learning, also having something related to exercise can also be good. Now, what's next for you? Can we expect, you know, further, you know, knowledge sharing around the hard work being done by a two-person team for the Hamburg Port Authority? What do you want to do next?

Artem: I'm writing at the moment a book with a colleague from Italy who are now living in... in in syria i think yeah yeah he traveled from italy um so we are it's to swiss yes now he is living in swiss and we are writing a book and try to share our experience with github's deployments with github's deployment for kubernetes i think it's gonna be called this is the title we chosen and he has more the developer the point of view and i have more of the platform engineering and we try write a book about to share it because i get a lot of requests on my blogs and if i can share it and how it and now we want to make it to to put in a book and this is my next big topic i think and also going to the conferences next year in paris to the KubeCon for an exchange and keep updated, keep learning, keep exchange, networking with teams, because we also are searching for a solution for managed Kubernetes like Kubernetes. platform. So maybe we are, the HPA are searching for a stable solution. At the moment, we are building a solution with a Kubernetes service catalog, we call it, but every open source tool have to be tested very well in the integration and things like this. And also we have some plans, maybe to searching for a provider that will handle the platform engineering path, like maybe, I don't know if you hear about it, Giant Swarm, Adidas using it. because they don't have platform engineers. They have, I think, over 4,000 developers and Giants warm provided the Kubernetes clusters they needed. And we also may be going this way. So because of that, I also have deep insights in the cluster API provider. that are using Cluster API in different ways, like Kamaji, like KKP, or like Giant Swarm.

Bart: And the book, I don't want to put too much pressure on you, but when will that be coming up?

Artem: I think in the mid of the next year, May or June. It depends.

Bart: Okay, but still good. I mean, a tentative date. I don't want to hold you to it, but good just so that folks know that'll be coming out. And if people want to get in touch with you, what's the best way to do that?

Artem: LinkedIn, direct message over LinkedIn. It's going to be the shortest way and the easiest way.

Bart: That's how we met. So I'm a witness that I can say that it worked very well. Got a quick response. Anyway, Artem, thank you so much for your time and also for all the effort that you spend sharing knowledge. It's extremely helpful to have people like you in the ecosystem that are helping others level up. I look forward to your next steps when the book's coming out. Hopefully we can hear more about that and also cross paths in KubeCon in Paris. So in the meantime, take care and we'll see you soon.

Artem: Thank you for the invitation. I try to contribute my way to the community.

Bart: Keep up the great work. We'll see you soon. Cheers.

Artem: Cheers.