Network Policies are the wrong abstraction

Network Policies are the wrong abstraction

Host:

  • Bart Farrell

Guest:

  • Ori Shoshan

Network Policy usage is inverted.

It's easier to list the services that you want to connect to, but Network Policy forces you to list all clients that can connect to your pod.

How would you even know that another team plans to connect your apps?

But if Network Policy is not the right tool, then what should you use?

In this KubeFM podcast, you will explore:

  • How Network Policies are not as bad as you might think, but they are low-level APIs that are not always practical to use directly.

  • Intent-based Access Control (IBAC) as a higher-level abstraction to describe your network segmentation requirements.

  • How you can use IBAC to generate Network Policies, Istio Authorization Policies, AWS IAM & Roles, and more.

Relevant links
Transcription

Bart: Network policies are not the right abstraction. What would be the right abstraction then? This is something you'll be hearing about in this episode of KubeFM. We'll be diving into this topic with Ori. We know that networking can be tricky and managing the correct networking policies can also be tough. So that's why Ori took the time to detail this in an article that we'll be speaking about in today's episode of KubeFM. KubeFM is a podcast that dives into the latest and greatest trends in Kubernetes, hearing from experts, sharing their knowledge, content from engineers for engineers. My name is Bart Farrell, and I'm the vivacious voice of KubeFM, where your cluster is always running smooth, your latency is low, and there is absolutely no downtime. Let's take a look at the episode and see what Ori had to say. Ori, welcome to KubeFM. First things first, if you had a brand new Kubernetes cluster, which three tools would you install?

Ori: So, Otterize? No, but I'm just kidding. I guess you'd have to go with cert-manager to manage external certificates, external-dns to manage the DNS for those certificates, and to work with cert-manager to issue Let's Encrypt certificates, and the HAProxy Ingress Controller. Those are, I think, the bare necessities to get a cluster running with workloads that are internet accessible. And I actually see that the HAProxy Ingress Controller isn't the most popular. Most guides would recommend Nginx. But it's sort of a personal bias that I have, an affinity to HAProxy, because I've used it multiple times even before Cloud Native and always had really good experiences. And actually, the first time I tried... The Nginx Ingress Controller, I think... It tried to have like multiple Ingress resources pointing to the same hostname and it didn't work out of the box. I had to like reconfigure the controller. I was kind of surprised. That's something that I thought was pretty basic. I'm not sure if it was the exact same thing. Pulling out of my memory now. But I was kind of surprised it didn't just work like it did on HAProxy. I was just testing something out. So... We love HAProxy. And in Hebrew, you might say HAProxy, which H is like the in English. So it's literally The Proxy, which is also a great gag.

Bart: That's good. It's a nice language choice too. All right. So we have that cleared up. It's good to hear as well, because you mentioned even before, you know, you got into Kubernetes, thinking about your journey into cloud native. You also mentioned a little bit about Otterize, but tell us really quickly, who are you and what do you do?

Ori: Hi, I'm Ori, I'm the CTO at Otterize. What we do is make declarative zero-trust access really easy. So basically, each backend service declares its intentions, what it needs to access other services, other databases in a high-level Kubernetes resource. And Otterize then goes and figures out how to configure your infrastructure to make it work. whether that's configuring Kubernetes network policies, the service mesh, even AWS IAM policies and roles, whatever you have. As an aside to that, I still get to do quite a bit of hands-on work, even though I'm the CTO, which is awesome.

Bart: That's not common. So good for you. Congratulations. You could write a book about how you're able to do that because a lot of people would like to know.

Ori: Secret is a great team. I always joke that my ultimate goal at Otterize is to be a completely useless CTO because the company will be so self-sufficient that I'm not going to be necessary in any capacity.

Bart: Well, that's a really nice thing to say. And shout out to your team for giving you the capability to do the management side and also still be in touch with the technical side. Because it's a common trade-off that people often feel like they have to sacrifice one or the other. Now, you've been using... You've been doing programming for quite some time, but we can go as far back as we want. Apparently you've been using MS-DOS. You're an MS-DOS user by the age of three. All right. So I'd like to know a little bit about your journey, how you got into Cloud Native.

Ori: So I guess saying I was an MS-DOS user is a bit of an exaggeration because, you know, I knew how to type deal and like find executable for command and conquer and run it. But it's basically, at the age of three, it's a magic incantation. You know which buttons to press in what sequence. So how did I get into Cloud Native? So I think my first encounter with Cloud Native was actually not with Kubernetes. It was in 2016 at a company called GuardiCore, as in guard and core, which is now part of Akamai. And I was developing on Marathon, which was running on Apache Mesosphere. It pretty much lost the container wars, but it's still out there. It was pretty cool back then, but it's honestly, compared to Kubernetes today, it was kind of clunky and things didn't work very smoothly. And you had to have really just to run a service as a developer, you had to have all the ops know-how as well. And later that year, I started building for customers who were running on Kubernetes as well, but our stack was still on Marathon and not Kubernetes. So Kubernetes, I really encountered in 2018 in Rookout, which is now part of Dynatrace. And I was fortunate to then join a team which was already on Kubernetes. So I had a working example of how things should be. which I don't know if I'm going too much at length, but I think that Kubernetes tutorials, like the basic ones are quite lacking because they kind of tell you, this is how you create a pod and this is how you create a deployment, but they don't tell you this is what an actual, an app deployment looks like with all the different pieces and what are the common patterns. It's like telling you, here's a pencil, here's an eraser and I will go build an app. It's not showing you "this is how you write".

Bart: I like that. So certain assumptions that are built into that. I think that's something that some people might encounter as well is that, all right, tell me as if I'm the three-year-old who's trying to get basic MS-DOS commands so I can play Red Alert, right? Is that until we can break these things down into simple concepts. that we can explain to our friends, to our neighbors. Maybe we need to rethink this a little bit. In your experience though, you did have that experience with, you had tried out Apache Mesos in 2016. In terms of how you learned Kubernetes, what was your approach? What were some of the resources or techniques that you used to get to the point that you were more comfortable?

Ori: So I think at the time there weren't many great guides, but it really helped that I had a working example. You know, we were using Helm. We had Helm charts and requirements between different Helm charts for different microservices at Lookout. I think the big epiphany that I had that really helped me grok how Kubernetes works is understanding that Kubernetes YAMLs, they're just a serialization of a Go object. that when you apply, when you kubectl apply YAML, it gets saved into the cluster. And there is a software component in the cluster that deserializes it. And then just loops checking, is this is the configuration in the object, the actual reality? So understanding that all deployment object is, all deployment YAML is, is a deployment object that says you need to create five pods with this configuration. And then there's an infinite loop that checks, hey, do I have five pods with this configuration? No, then I go and change that to make it five. And that each of these components is really standalone because they're all just looking at one resource and doing actions in terms of other resources or the system. And before that, it kind of seemed like the internet and my colleagues were basically saying, if you do this, magic is going to happen and a container is going to spin up. You know, the problem with thinking things are magic, I mean, maybe it's fine as a free world trying to run Red Alert, is that when things don't work, you really have no idea where to start. So working that, which in my mind was really aligned with, you know, how things worked in Java in the old world, you know, you serialized objects into XMLs, beans, all that, and that became into Java objects that you then worked with. This turned what was a complicated distributed system in my mind, to just a series of really simple independent components, which turned out to be really how it works because that later helped me get into building Kubernetes controllers when I was a bit more advanced in my journey. But yeah, it started out really like, when I got to that point, it suddenly all made sense. And I also realized that all the new worlds, like deployment, pod, and all that, They're just new words for existing concepts. And when you're learning a programming language, one is the same as the next one. The only difference is the syntax. So here it was basically the same concepts, the same mechanisms, even with the serialized XMLs in Java and the serialized YAMLs. But the problem was that you're thrust into this world where they've basically changed all the words. And you have to first figure out the lingo before you can figure out how things work.

Bart: That's, I think that kind of segues nicely into the next question is that if you had to go back and give advice to yourself as a Kubernetes beginner in 2018, what advice would that, would that be? Is that, you know, in terms of the lingo, understanding that maybe it's going to take some time to get the lay of the land, get these key concepts down and then move forward or perhaps something else?

Ori: The thing is that I guess in 2018, I was into Kubernetes as a developer for a different team, making his first foray into another team that was working on Kubernetes in Rookout. But I guess my tip to my earlier self is to to think about not to rush into a solution that comes to mind. But take a moment to explain, even if it's just to yourself, what's the root cause of the problem or the thing that you're trying to do. Sometimes it's a technical set of circumstances that you hadn't considered that are coming together to create some sort of systemic problem. And if you don't fix that root cause, you're fixing a symptom. And there's going to be another symptom after that if you didn't fix the right problem. And sometimes the root cause is actually a people problem. And the solution is in addressing that problem. So, I mean, you can fix the symptom, right? But if it's really a people problem, then you gotta fix it for the people. And sometimes the solution isn't code, could be like an organizational solution, but the real kicker for me is when a solution to a people problem is a technical solution. It's great when you can change things so it's easier for everybody and it makes more sense to everybody, then that's when you can really solve a people problem. If you make the trivial choice the right choice, then you've arrived at a solution to a people problem. I always like to say, if you're trying to do something technical and you feel yourself struggling and trying to find a way to do it, then you might not be doing it the way the author intended. But if it's feeling smooth and easy, then you've hit the jackpot. You're doing the use case to the API, the library, that whatever you're doing was built to handle. And some people problems can be solved with really trivial technical solutions. Like take, for example, in Python, there's, the indentation, like whatever you use spaces or tabs or how many of them you have is significant to the syntax. And if you don't have a standard for what spacing you use in your organization, occasionally you're going to hit into actual problems. For example, a merge conflict can end up making some part of the file have tabs and other parts have spaces. And that can actually break the code from parsing. So you can fix that with a people solution, you know, by having a code conventions document and deciding that there's going to be only spaces and force each tab is four spaces and like educate people that this is what you need to do. But occasionally people are still going to argue about that or, you know, somebody is going to forget to configure their IDE and it's going to create friction. But if you fix it with a technical solution like a linter that just makes it so. then most people really don't care all that much about whether to use tabs or spaces. They may have opinions if asked, but most engineers would be happy to just get their job done and never have merge conflicts about tabs. Never. So I think that's a good example for like a... small technical problem that can turn into a big people problem that constantly wears you down. And you're like having people in code reviews talking about whether to use spaces or tabs, which is not what people want to be spending their time on. It's a bit of a silly example, but I think it was kind of illustrative.

Bart: Not at all. And being in the position that you have as a CTO and having to think about, you know, what are the potential problems that will arise based on the choice of a particular technology or another. How can I avoid these sources of attrition that can really wear things down and sometimes can be a make or break for certain people as to whether or not they want to stay in a company? And so how can you avoid those things? I don't think it can be. I don't think those issues can be underestimated. It's not that you have to be paranoid about this stuff all the time, but you have to know that the people problems can be as big or bigger than the technical problems. And so in terms of, like you said, the solutions that will be available to you and how you go about it, that's definitely something that has to be kept in mind. I also like what you previously said about how... No matter what you decide to do, rushing into a solution just because of whatever reason. probably not the best idea. And considering how you can use, leverage people-based solutions, leverage a community, ask questions, go out there and see what people's experience has been like. When you brought it back to the beginning about, you know, HA proxy versus NGINX, you had a particular experience that you can share with others saying, hey, this is what I went through. Doesn't mean it's going to be the same thing for everybody else. But, you know, using active listening is, I can't, I don't think it can be recommended enough. Now, In terms of the article that you wrote that brings us here today, you wrote a post about how network policies are not the right abstraction. I want to know a little bit about why you decided to write this post. What got you to the point where you're like, you know what? I got to write this. It's not just a thought. It's not just a conversation. I'm going to put this into an article. What was this post about?

Ori: So my post is about how you might use network policies for zero trust or network segmentation sometimes, the adjacent concepts, and why that's hard to do with the API that network policies present to the point of being almost impossible for all but the smallest organizations. So what got me to write it is. I got to a point in my understanding about how people use network policies that I've seen repeatedly that people who are trying to apply network policies for zero trust end up solving the same set of challenges and they do it the hard way. They try to use it as prescribed essentially, and hit a bunch of snags and then develop solutions to those problems. Or in some cases even give up because it turns out to be a people problem that's too big for that person to solve in that organization. And I realized that I had a unique viewpoint on looking at many of those cases that I could share with people and say, Hey, so this is what I've seen. And I was fortunate enough to have a great blog by Monzo, a UK bank that I could have linked to, which I think explains a lot of those problems really well, that I thought it would be useful to present it in a bit of an abstract manner. So not saying, you know, this is how we use network policies, but rather. These are the problems and this is what an abstract solution would look like to help people reason about when you're trying to achieve zero trust, what you need to be successful.

Bart: Once upon a time, I tweeted asking, what's the hardest thing about learning Kubernetes? And a lot of people responded that networking was the hardest thing. I think that by far is the one that stood out the most. Can you just walk me through really quickly what are network policies and how did you become so passionate about something that a lot of people find really, really hard?

Ori: So let me get the last part of that first. I guess what I'm passionate about isn't network policies, but that people find it hard. And I really think if something's hard, it has to be changed. It just means it's not, you know, going back to if you're using the API the way the author intended, things should be easy. That if it's hard, it just means the API and the actual use case, they're at a mismatch. So that's why it's hard. Anyway, so network policies, what are they? So I guess the way I like to think about network policies is they're sort of the equivalent of firewall rules, you know, like legacy firewall rules, big checkpoint firewalls, IP tables rules from the old world, except in Kubernetes. They give you a way to control network access in a way that maps almost one-to-one to firewall rules. They specify which traffic is allowed and blocks all the rest. So normally when you have a pod that runs in Kubernetes and it has no network policies that apply to it, all traffic is allowed. But once there's even just one network policy that applies to it, that is it explicitly allows access to some sort of destination, all other traffic that isn't explicitly allowed gets blocked. And yes, what you're inferring is correct. That means that you can only allow traffic with network policies. You cannot deny traffic. So in order to deny traffic, you need to allow something. And then finally, there is one extra bit of complexity. Network policies can either refer to ingress traffic, that is traffic that is incoming to a server that is listening for connection, or egress traffic, that is outgoing traffic from a client. So if I'm connecting to the internet, or if my service is connected to the internet or another service in the cluster, that is egress. Restricting ingress would be useful for ingress. Incoming traffic would be useful for zero trust or network segmentation, which is making sure that incoming traffic to a server is allowed. And restricting egress traffic is useful if you want to apply a policy like pods in this cluster cannot connect to the internet or they can only connect to a specific set of third party providers or only specific pods can connect to the internet, which is a policy that companies often have.

Bart: All that sounds good. In the title itself, you say that they are the wrong abstraction. Why is that by comparison? What's the right abstraction?

Ori: I think it's the wrong abstraction because essentially it's at the wrong level of the stack. Network policy is not a bad API. It's just not really intended for zero trust. And if you try to do zero trust with that, you're trying to do it with a low-level API. And that's hard. Why is it hard? So let's say I wanna secure communication between two pods. So in order to protect the server pod and make sure only access from intended pods is allowed, I have to place an ingress network policy on the server. So my use case is I am the client. I want to say that I want to successfully connect to that server, but I have to go and apply an ingress network policy on the server and also label the server pod. and the client pod so that the network policy can refer to those pods and say, essentially my network policy will then say the server pod with this label is going to allow access to the client pod with this label. So I've now had to configure a network policy for the server, the server pod itself, I had to label it and also the client pod. So that's three resources. This can be challenging to do. I think, I think usually it's very challenging because you need to coordinate the client and server teams. They need to agree on what the labels are. They have to deploy in the right order. So you have to deploy the client first. So it has the label before you deploy the network policy. If you don't do it in that order, the client in the previous version before the label is going to get blocked and you're going to get production downtime. And if you end up having to roll one of them back, you better coordinate because again, if you do it in the wrong order, you're going to end up with downtime. So I think that's, that's a really high bar just to secure communication between two services and it gets worse as you add more and more, you're effectively adding more things. You need to coordinate more people in the loop.

Bart: That being said, you did mention previously the... The issue around technical versus human problems and communication is often a major problem in different organizations. And it looks like the network policies, as you're talking about them, can be owned by multiple teams. So in order for that to get better, facilitating better communication between those teams, wouldn't that be the answer? I know that sounds maybe oversimplified, like just talk to each other. But how can you encourage that to avoid the kind of friction that might be there without that communication?

Ori: I think just talk to each other is a solution and it can work, but I think it's better to have a solution that... Let me just switch for a moment to something else. You know, when you're deploying your service, you don't have to go to the ops team and tell them, hey, you're managing the infrastructure that's running these Kubernetes pods. Can you please spin up a pod for me? Make sure to have the right infrastructure so I can spin up a pod. You just create a pod resource, a deployment resource, which creates pod resources. And your pods spin up because there's an API for that. You didn't have to go talk to anybody. And that's really aligned with what the organization is trying to do, right? So organizations today with a DevOps approach and a lot of organizations are even investing into platform groups and their entire goal is to increase velocity and keep a good level of security by enabling teams or even independent individual engineers to move independently of each other. So. But this, what network policies are creating is a dependency between teams, even though for every other part of the process, the pull request, the code review, the actual deployment of the service, everything is independent. The team can progress on its own. So in this case, having to create a network policy on the server and having to label the pods, it's just an implementation detail. It's not the thing that... Yeah, that the different teams are trying to achieve. What they're trying to achieve is getting access securely. They don't care about what the labels are, you know, what the network policy looks like exactly. But if you do make that something that they need to coordinate, I promise just like Python and the indentation, even though they don't really care about it. They're going to be bike-shedding endlessly about, no, we use different labels for pod conventions. We use different conventions for pod labels. And we can't have that additional label so it works with your network policy because we already have a label that names our service. So we want to use that one. And, you know, you take that, which is already a complicated inter-team situation. And you add on top of that, the inability to look at a network policy and say, is it actually going to work? And the risk of making a mistake, which is if you make a mistake, production is going to be down. That kind of tends to turn things into a game of hot potato where people are, you know, either trying to pass this problem over to somebody else or doing really strange things all to avoid changing the network policies. Getting a service that can already connect to the right destination and adding like a weird third arm to that service so they don't have to do that. Or you get really stringent and slow review processes that can grind things down to a halt, which is also crap and nobody wants that. The company doesn't, the devs don't, the platform team doesn't.

Bart: With that in mind, you know, thinking about the stakeholders, nobody wants to feel stupid. Nobody wants to feel like there's a, you know, that they're, that something's going completely over their heads. However, the fact that network policies can be difficult or seem frightening even to some people might exacerbate that issue around wanting to communicate or wanting to go further. I think the flip side, though, is it can be seen as an opportunity to help people level up. What's been your experience there in terms of like, okay, this is a learning opportunity so we can bring more stakeholders into the conversation, help them feel more confident and comfortable. What have been strategies or techniques that you've used there to make that situation improve rather than get worse?

Ori: So what I've seen happening is... In big organizations, often a bunch of the engineering organization won't actually be writing Kubernetes YAMLs on their own because it's perceived as complicated. So leveling up, you know, being a team that works more closely with the platform group. mostly means that you write Kubernetes configurations, but that's one side of it. And trying to write network policies is incredibly challenging because it's like, you look at a deployment spec and you can, one person can look at it and see, this is going to create this pod with this container, it's gonna work. And you can't get that with network policies. It's like on a whole nother level of complexity. So I think leveling people up would be easier with a resource that is more high level. And what we've seen people do is they actually build that on their own. So they write a sort of abstraction that creates network policies eventually. Some of them even do it with Terraform or a custom CI step to give the engineering teams more control. but they've essentially recognized that with network policies, it's not going to work. So it's not that people are scared of Kubernetes configurations with network policies, and also that you need to know what are your pods looking like, everybody else's pods that are related to the network policy, the network policy itself, and how it gets deployed, which is a lot of information even for the platform team who is working on this, let alone somebody whose day-to-day is mostly writing code and, you know, things get rolled out without hands-on activity most of the time. So, yeah, I think any solution that goes towards whatever, let's just let the dev teams write network policies tends to fail because it's just the bar is so high. And because people think about other Kubernetes configurations, they think, well, they'll just, you know, figure it out. They'll level up. They'll start writing Kubernetes configurations. There are so many levels that aren't visible at the surface level and you have to get burned.

Bart: Sometimes you just got to get burned. But in terms of these complexities, one case study that you referenced is from Monzo. Can you talk about that more in detail and what you found to be most relevant in terms of that particular case study that highlights these problems around network policies?

Ori: Right. So Monzo is a good example. I mentioned them a bit earlier, but for those who don't know, I'll say a bit more about them. Monzo is a UK digital bank running on Kubernetes. At the time that they wrote the blog post, I mentioned they had 1,500 microservices. And the blog post was, the really excellent blog post was written by Jack Lehman, who was then the lead for the network isolation project as part of the platform and security team there. So their initial goal was to, uh, to secure the ledger service, which at a bank is the service that essentially has the API for transferring money. So you can make calls to it and say, transfer money from this account to that account. which I think it's obvious why they wanted to secure that. But the greater goal was to apply zero trust to everything. And I think they learned a lot of the hard lessons, a lot of these lessons the hard way, by trying something and seeing what doesn't work for their engineering teams. So they found that they needed a safe way to test whether network policies were going to work. And they didn't have one. And along the way, they tried many different solutions to get to that, some of which were infeasible. For example, they were using Calico as the network policy engine. And how Calico works is it creates IP tables rules on each node. So they try to log all... all traffic using IP tables, but that gets extremely excessive because it's basically on a per-packet basis. So you can't really do that. But they found a way to... And they had to build a bunch of custom stuff, basically. They found a way to filter that down to just what would have been blocked without actually enabling the network policies and then generating the correct... the correct access graph, I guess you could say, based off of that. So they found that rolling back services was really risky. So if you had a server network policy that allowed a bunch of different clients to access it, that was great. But if there was an unrelated problem on the server, and for some other reason they needed to roll it back, you could end up rolling back the network policy, blocking the clients, even though... the server could have still served the request, right? Because when the client was added to the network policy to allow access, say when a new service needed to call the ledger to transfer money, it's not because the transfer money API changed that day. There was just a new client. So there was no change to the server. So instead of the access controls being versioned with the client, they were versioned with the server, which made it really hard to know, like, can you roll something back? And during a production incident, you don't wanna be thinking, if I roll this change back, which is causing an incident, am I gonna be causing another incident because of network policies, which are all determined at runtime? That's so hard. So what they ended up with is a custom configuration format, which allows each client service to say, I need to access this service, that service, and so on. And then they compiled that into Kubernetes network policies at deploy time. And another advantage they had is because their tech stack is very homogeneous. So they have just one RPC library. So they were able to write a tool that passed all of their code and determined which services call which other services. And they had some exceptions. which they figured out using the IP tables trick I talked about earlier. But this entire thing I'm talking about now, just, you know... Starting in the rollout, getting to a point where something works and it scales, not in terms of performance only, which was also a challenge when they had a ton of network policies that was really low performance. So that's another technical challenge they needed to solve. But I meant scales in the organization. They had 500 engineers that needed to be able to deal with this without constantly going to the much more undersized platform team for help. Otherwise, it would have slowed everything down. So the challenge here is like, how do you get zero trust without making everything crash and burn or making everything go super slow? And I think it's a really good case study of what a successful story looks like and how much effort goes into it from the platform teams.

Bart: Like you said, you don't want to destroy everything in the name of security or bringing everything to a halt because of things like zero trust. And it sounds like overall with network policies, there can be a lot of obstacles or downsides or disadvantages. Ideally for you, what should network policies look like? What's your proposed solution?

Ori: Um, so I wouldn't necessarily change network policies cause they're trying to be a low level API and they're fine. I mean, it's not a bad API. It's just not for this use case. So an ideal solution would, um, allow, like in the Monzo example, allow clients to declare what they need. So I need to access this service or I intend to access this service. And this declaration has to be independent of other services, just the same way that other declarations are when I spin up a pod, I don't care if there are other pods, that's the control planes problem. So. This means that one developer could write this declaration and he wouldn't have to coordinate with other teams or create multiple pull requests for other repositories and coordinate deploys. They just need to manage the one service they are changing now. So one person, one service, one declaration, that's what this achieves. And the second thing is that the declaration, it must be possible to statically analyze the declaration. of which access this service needs without relying on any runtime information. So when you look at a single service, a single developer looking at a single file must be able to determine without any tooling that the declared intentions are going to work in practice. Like if I look at a configuration file and it says, I am going to call the ledger service, I have to be able to deduce just from reading that file that I will be able to call the ledger service. It shouldn't be have to go to another repository or think like if it gets deployed in production, it's going to look that way. If it gets deployed in staging, it's going to look that way. It should literally say ledger service. And all that can only work if services are referred to by a sort of universal identity. So That is the identity you're thinking about when you're coding or designing the service and not at runtime. So if you walk up to an engineer and you ask him, which service are you working on? He's going to be saying, I'm working on the ledger service and not, you know, I'm working on ledger service-A2DF running on cluster pod-10 in namespace so and so. Because you're thinking in terms of design time, not runtime, when you're developing the service. And that's also when you're writing the configuration. So the identity you use to declare the access has to be in terms of that design time identity. The universal identity has to be like that. So what I'm envisioning is a file that says, I am the checkout service. I am going to be calling the ledger service. And. As a single developer, it's possible for me to read that and know that my checkout service, when deployed, is going to have access. Because that's all I care about as a developer. The organization cares about zero trust. I care about my service functioning.

Bart: Now, in terms of this solution being utilized, you did mention Calico previously, but are there other projects where we can see examples of this in action?

Ori: We haven't yet talked about Otterize at length. So there's the Otterize intent operator, which operates client intent resources, which is essentially what I've described. You say, I am this service and I'm going to be calling these other services. And this is then converted into network policies or other kinds of access controls that you might have. even ECT authorization policies or AWS IAM policies. And then there's also the Otterize network mapper, which comes bundled with it, which can auto-generate these client intents based on traffic. This is another thing that Monzo had to do. And when you pair them together, they can simulate the rollout for you. So it can tell you, you have all the declarations you need for zero trust, and it's safe to now block everything else, which is really hard to tell when that's when that moment comes with network policies. So there are other projects. You might also consider Cilium or Calico network policies or even E-State authorization policies to achieve network segmentation and zero trust. And they bring additional capabilities that native Kubernetes network policies don't have. So native network policies are very basic. They allow you to restrict traffic on a pod-to-pod level or IP-based or port-based. So a capability that Calico Network Policy is bringing is referring to Kubernetes services directly and by name, as opposed to referring to them. So instead of saying. I am referring to the pod selector with these labels. You can say, here's a Kubernetes service name, which is closer to what you actually care about as a developer. And so the developer, you're not connecting to individual pods, but they're selector. You're not resolving which pods those are in your code. You're connecting usually to a DNS name that belongs to a Kubernetes service. And something else these other projects bring is they support enforcing access at cluster level or at layer seven level. So while classic network policies only allow pod to pod or port level, so layer four, these projects can take into account which resources such as HTTP paths and methods or Kafka topics are part of the access you're trying to allow. But I think there's an important distinction between these projects and others. So, Cilium, Calico, and Istio, they all do other things other than zero trust and access controls. For example, Istio is a service mesh. It does a ton of other things. And the way I think about Calico network policies, Cilium network policies, Istio authorization policies is they are an API to control the enforcement mechanism that each of these projects brings. Whereas Otterize does not have an enforcement mechanism, right? We're not a container networking layer. We don't replace Calico or Cilium and we're not a service mesh. We don't replace Istio. We configure those policies for you. So we actually support Cilium and Istio and Calico and configure them based on the client intents. The way to think about Otterize as opposed to those is that Otterize is a platform tool that enables automated and self-service rollout of zero trust using your existing platform. So it's not a replacement. We don't want you to replace any of them. You probably have them because you have some sort of other requirement that they're... answering, but they can be unwieldy to use, which is, again, I think fine. Because the network policies and authorization policies achieve a bunch of other things and not just zero trust. So it's fine that they're not specialized for this use case. And they often come with different tooling for holding out zero trust. So mapping your network, seeing what is going to get blocked, simulating, is everything going to be fine once I enable enforcement? is not always as good as it should be, which again, I think is fine because they try to do a lot of other stuff. Now here it's important to say I'm kind of bundling them together and generalizing as a result. So I want to say there are tools to make it easier to roll out these different kinds of policies, but there is a very different goal in mind. So Otterize looks at how organizations work with access controls and brings together the management of multiple kinds of access control. So we don't think just about an Istio service mesh or Calico network policies or even AWS IAM policies. We recognize that organizations may use all of those at once because they have different needs for different teams, for different products. And what Istio or Calico or Cilium would have you do is they want to be the one-two way for network communication and security. So they want you to create like a multi-cluster service mesh and so on. And in many cases that may be technically infeasible or just impossible for your organization. So if you're a bank and you've bought, you're a big bank, which has bought a hundred other smaller banks. it's going to be exceedingly difficult to integrate all of their networks together in one big happy service mesh, which is cross-cluster and cross-technology and cross-stack. So we're saying you have your stack, use your stack. We're just going to make it easier with client intents. And we don't want to replace any of that. I think it's right for what Calico, Cilium and Istio are trying to do because they do all of these other things. But for other guys, it's not right because we're trying to make it easier to work with what you have. So we don't attempt to tell you how to enforce, but just make it easier. Sorry, that was a bit of a long one. But I've been asked that question a lot of times.

Bart: So you thought about a lot and put together a very thorough, in-depth answer. Those changes that you just shared, should those eventually make it back to network policies in the Kubernetes core?

Ori: So I guess there are two sides to this. I think that it's wrong for network policies to change, for the native network policies to change significantly. I think there is value to keeping backwards compatibility and all the capabilities that are closer to firewall rules that they bring. which have their users, which are not zero trust, and that's fine. But I do think there is room for another API, which is higher level, even if what it does is builds on top of different kinds of network policies and mechanisms. And I actually hope to make client intents, so not necessarily this specific resource, but the approach in client intents of intents-based access control. to bring it into upstream Kubernetes. And we're actually exploring this direction with as a cloud native computing foundation, sandbox project and with partners at Microsoft Azure. So yes, I don't think we are there yet in terms of usability. We still have a bit of a way to go. But I do think it's a, I think the concept I back intent based access control is a lot closer to how people actually work and network policies. I mean, you know, this is KubeFM. We're talking about network policies, but the same problems exist for many different kinds of policies in the world. So I think the concept is, is greater than that. And like my vision, all the rises vision is for IBAC to be the way people do authorization. for your backend and not just for network policies. It should be as easy to get access to different cloud native resources and even cloud-based resources as it is to spin up a pod. And right now it honestly kind of sucks. There's a lot of different stuff and it sucks. Nobody wants to deal with it and everybody has to deal with it on some level.

Bart: And it's good because what you said previously is like, if something's that hard, well, there should be simplified. There's got to be a way to make this easier. And so I think it's a, it's a noble, it's a noble fight. That's one worth fighting. In terms of the reaction people have had after, you know, publishing this, this article, how, what's been, what's the response been like?

Ori: So there were the people who inspired it, who I knew were going to react very positively. But honestly, I was kind of surprised by the amount of reactions we got and even the way they were delivered. There were people who walked up to our KubeCon booth in KubeCon EU in May just to say that they read the blog post. And there's one really hardcore user that I can try to quote off the top of my mind. So the thing that we, that we nailed is that the direction in network policies is inverted. It's so much easier to say. I am going to call these services rather than as a server saying, here's the list of all the services that will call me. Because you only know what you know, like how would you know that another team is now going to call this server? It's a lot harder. And I think that really boils down that the difference of the approach between IBAC and more classic policies, which are often on a point of view of servers. why I think this has potential as like an open standard for authorization, just conceptually.

Bart: Like you said, certainly getting people's attention to the point that they're coming up to your booth at Kube, you know, to talk to you about it directly. You know, one of the things you mentioned earlier on in the podcast was about the technical versus people-based problems. And earlier, you know, at the beginning of this year, you gave a talk about people-oriented programming. In terms of we should help people do the right thing when it comes to writing code, it seems like you're also driving, if I'm right here, that this is also playing a role in terms of how you're approaching network policies. Does this go, you know, it seems that you're very, like I said, motivated to help people, guiding them into doing the right thing. Is this just in the technical world or is this also something you do in your life in general?

Ori: I try to apply that approach everywhere, but I think the right thing is very highly context dependent. You know, maybe network policies, I feel like I have a good grasp of what I perceive to be the right thing, which I parse as. what is easier for most people to do and most effective in achieving their goals. When applied more broadly, it's not as easy in the tech world where things are well-defined by APIs. But yeah, I have learned to think about the world in terms of systems thinking, which I guess that the funny way to think about that is like the butterfly effect. So people will often take the path of least resistance when they're presented with a choice, not because they're ignorant of what will then happen, and not because, sorry, what I meant to say, they will take the path of least resistance, the easier choice, but it's often because they don't know at the time what the consequences of that are. So with network policies, saying, whatever, let's just let the dev teams do it. that results from honestly thinking that this is the right thing to do in your situation. But that often has repercussions later down the road. So it's possible to try to teach people about the effects of doing so. Like my blog post about network policies, that's essentially a way to say, hey, network policies are quite hard and you should think about it that way, which is possible. But this is like if we go back to the linter example from earlier with the spaces and tabs in Python, that's like telling people, hey, why don't you stick to just spaces? Because who cares? But people are going to care and are going to have opinions. And it's not so simple. I think what's better is to give them a tool that they perceive as as the path of least resistance and that also solves those problems. Like if they never encounter those problems or they encounter them for the first time, like, "oh, wow, OK, I didn't realize that here's a use case that works for me". And if I did it with network policies. would have probably been hard. So maybe to take a real life example, if it's easy for people to recycle because you put recycling trash cans strategically next to like vending machines and that sort of thing, then they will recycle. But if you make them go to like a special place to deposit plastic bottles, then fewer people are going to do it. Only the people who are motivated or have been educated well enough. But I mean, you don't have to be right. You just have to make things easier. Everybody wants to recycle. Everybody wants to be a good person and do the right thing. But it's just not always so easy. People have busy lives. So it's on the people who decide where to put the trash cans to make it easier to recycle. So it's easier to do that.

Bart: No, no, it's a good answer. I think, you know, anticipating that resistance, anticipating those difficulties, understanding motivations, you know, the line from John Lennon that, you know, life is what happens when you're busy making other plans. And so as much as we might want to be doing one thing, if something's getting in our way, then that idea might go out the window. It sounds like you're thinking a lot about a lot of different things. So I want to know, what's next for you? Do you imagine writing another article that might get enough attention for people to come introduce themselves at your stand at KubeCon? What's next for you?

Ori: You know, I'm hopeful because I think this article actually helped a bunch of people. I think it helped them reason about how to do things and how to make life life easier for them, which is worth it all on its own. I'm gonna continue on my mission as part of Otterize to, you know, make access control really easy because you know, it doesn't have to suck it, it only sucks because it's so it's so fragmented. Like if you look at Android or iOS app permissions, there is one way to specify permissions and you say, I as the app, I need to access these things. And because it's so standardized and declarative, then the user gets like a nice pop-up that says, the app wants to do this thing, do you allow it? And as a developer, you know that it's gonna work. Like if they press yes, your app is gonna work. So there's no reason that we shouldn't have the same experience in cloud native. When I deploy an app and I'm accessing services on AWS and my database and other services in Kubernetes and services in another cluster where a thousand different mechanisms and open ID connect, there's no reason that I shouldn't be able to say, I want to access this thing and to know that it's going to work and the world today with security breaches, you know, every couple of days with lots of identity theft and awful things really needs it, but there's no, there's no reason for it to be impossibly hard. We sort of like, just like accept it, that this is how things are, but it doesn't have to be that way. It's not that way with Android, even though they have a thousand different resources you can access and the GPS and the fine grain location. Yeah.

Bart: So you're busy. That's right. And we can expect more. And I think it's helpful. And it's good to see that, you know, by giving to the community, you get a lot of things back. And to also know that, you know, this is how we found out about you is because of this article that you wrote. And it's also really good to know that someone else who was previously on our podcast, an amazing guest, Adriana, is going to have you on her podcast, Geeking Out, which I highly recommend to everyone. And, and so I'll be looking forward to hearing that when that comes out for people that want to get in touch with you, whether it's love or hate regarding your position on a network policies, what's the best way to do it?

Ori: So probably on LinkedIn or on Otterize community Slack, which is a good way to ensure I get a phone notification, both love and hate, or, or just any request or advice is welcome. I honestly, I enjoy the hate a little bit more because it tends to be more colorful.

Bart: Keeps things spicy.

Ori: Yeah. Like my love for hamburgers, right? And you can also reach me via email on at ori at Otterize dot com. But honestly, I'll be quicker to reply on LinkedIn or Slack because, you know, I get a ton of emails, so I may not get notified for every one of them. But I'm happy to just talk about anything, and especially if I can help anyone, not just with access control or zero trust, even though I'm very passionate about it and chose. I'm happy to help with other stuff.

Bart: Like hamburgers. Yeah, I like hamburgers. And I will be checking out the Red Alert resources that you shared as a Die Hard Command and Conquer Red Alert fan. And I'll be asking you for help in case I need it. But that being said, already it was a wonderful conversation. I look forward to hearing future podcasts of yours, as well as checking out new articles about this topic that you're so passionate about. The topic that I feel you're really passionate about is helping people, making their lives easier, reducing... you know, stress and friction with things that shouldn't have to be that hard. And I really appreciate your effort in doing so. So thank you.

Ori: Thank you, Bart. Thank you for having me.

Bart: Pleasure. Take care.

Ori: Bye-bye.

Kubernetes experts reacting to this episode