Bart Farrell: Who are you? What's your role? And where do you work?
Elamaran Shanmugam: Okay, my name is Elamaran Shanmugam. I go by Ela. I'm a Senior Specialist Solutions Architect for Containers Domain, specifically Kubernetes, supporting the global financial services customers at AWS. And when I say global services, financial services customers, they are the large banks, insurance companies, and the capital markets and the payment processors like Visa, MasterCard, Amex. I'm a public speaker, and I'm also a blogger. And I also have recently published a book on real-world ML systems on Kubernetes that is available in Amazon Kindle.
Bart Farrell: And what are three emerging Kubernetes tools that you're keeping an eye on?
Elamaran Shanmugam: Obviously, there are multiple tools in CNCF ecosystem that are really emerging. But the three tools that I really like about that is emerging is the first tool is Kagent. A Kagent is basically a declarative way to create agents on Kubernetes, making an agent creation, first-class citizens on Kubernetes. The second tool is Agent Gateway. Think of Agent Gateway as an API gateway for agents, helping us to secure agent-to-agent and agent-to-tool communication through protocols like MCP and A2A. The third tool is Dapr, a CNCF graduated project. It's a pretty famous project. Think of Dapr as a tooling for microservices with use cases like state management and so on. With AI agents, it becomes a building block because of the use cases like memory management and securing agent communication.
Bart Farrell: As Kubernetes environments scale, what are the biggest blind spots platform teams face when trying to understand network traffic between services?
Elamaran Shanmugam: Obviously, I can talk about a few blind spots. But the biggest blind spot that I can start with is the infrastructure awareness and the Kubernetes context awareness. So when I talk about infrastructure awareness, so obviously, organizations use tools like VPC flow logs for network monitoring, which gives them the information around the IPs and so on. But when we talk about Kubernetes, it's about pods namespaces, services and so on. So it is very difficult to correlate between the gap that you have at the infrastructure level and the Kubernetes level. And then the second important gap is the east-west traffic. As the number of pods explode, you're going to have hundreds and thousands of pods running in a specific Kubernetes cluster. Which means that how these pods communicate, that east-west traffic is another biggest blind spot that we have. And the third important blind spot is, let's say a pod trying to communicate to an external service or a pod trying to communicate to an AWS service, that level of traceability is not there today. And that is where I see container network observability feature that we launched at the re:Invent for Amazon EKS is going to be a biggest killer because the moment you start using this feature, you're going to get a service map on your console, which really clears these blind spots that we talked about. So when you're running hundreds and thousands of pods, the east-west traffic is clearly seen in the service map on your Kubernetes cluster. And likewise, let's say your pod is communicating to an external service or an AWS service, you can see that trace. And the most important thing is it gives you really, solves the gap of infrastructure context versus the Kubernetes context when you're using container network observability. And that is why I highly recommend you guys to use container network observability on Amazon EKS to solve this blind spot that you talked about.
Bart Farrell: When an application team reports intermittent timeouts in Kubernetes, How do platform engineers typically determine whether the issue is the application, the cluster network, or an external dependency?
Elamaran Shanmugam: Obviously, it's a very painful problem and it's very difficult to eliminate where is the error. So app teams are going to say it's a networking issue. The networking team is going to say it's an app team issue. And then platform teams sitting in between the app team and the networking team trying to prove or disprove each other is going to be really challenging. So generally speaking, how organizations deal with these issues today is they start with application level metrics. They see the errors and timeouts and try to figure out it's an application level problem. And then they go into the infrastructure metrics like memory and CPU and try to drill down whether it's an infrastructure problem. And then they go into the networking issues and so on. But remember, right, by the time you really troubleshoot the issue at the pod level, the pod might be gone because it's an ephemeral pod, which means you ultimately not be able to troubleshoot the issue. But with the enhanced metrics that is provided by the latest container network observability feature in Amazon EKS, you get that flow level metrics. You can see retransmissions, you can see retransmission timeouts, and you can clearly pinpoint whether it's a networking issue and you can really reach out to the networking team to say, hey guys, this is a networking issue because the retransmission or the retransmission timeout is happening. And these metrics can be also extrapolated to and sent to Prometheus and visualized in Grafana. And then we also publish system metrics. So when I say system metrics, generally speaking, EC2 ENIs, at the EC2 ENI level, it is very difficult to pinpoint the problem because we don't know what is the retransmission bandwidth or what is the bandwidth being used for the ENI level. So with system metrics, you can really pinpoint that level and solve a problem that has been there for hours in a matter of minutes. So that is why it is very important to take a look into the features like container network observability on Amazon EKS to really solve these kinds of problems that has been pinpointed to networking or app team issues.
Bart Farrell: In microservice architectures, how important is it for platform teams to visualize service-to-service communication inside a cluster?
Elamaran Shanmugam: Oh, it is absolutely important. It's not even nice to have, it's a must to have. The reason is that, let's take a monolithic example. In a monolithic world, your code base is your call graph. So when I say code base, your code base in a monolithic world is going to have everything. Like right from how each and every boundaries of your application is being called, which means that your call graph is pretty clear from a monolithic application standpoint. The moment you talk about microservices, each and every fragment is running as a separate microservice, which means that your flow of service-to-service communication is your call graph. So there are three problems that this brings in. One is the troubleshooting problem. Let's say a service calling another service and if there is a problem, how do you troubleshoot? The second problem is the scaling factor. Let's say there is one microservices that is being called so many different times, you need to have a mechanism to know which one is being called multiple times and scale that microservice accordingly. And the third important problem is the security. You don't know whether a particular microservice can call another microservice. So considering these three different problems, it is very important to know the service map of the flow. And that is where the recent container network observability feature in Amazon EKS provides you the complete service map on your Amazon EKS console. So which means that with this feature, you can really say whether a service can call another service, trace that out. And you can also really understand which is the microservice that is being called so many different times. So you can scale it accordingly. And the third important thing is you can really troubleshoot a problem of a service-to-service communication. Where is the timeout and where is the retransmission that is required? So obviously, these metrics are available as open metrics as well, which can be pushed into Prometheus and visualized in Grafana. So as I said, service-to-service communication visualization is very important for organization for these obvious issues.
Bart Farrell: Many teams collect infrastructure metrics, but they still struggle to answer simple questions like which pods are the biggest talkers in the cluster. Why has this historically been hard in Kubernetes?
Elamaran Shanmugam: Obviously, it's an abstraction problem because Kubernetes abstracts the infrastructure, which means that infrastructure and the applications are speaking two different languages. And then when you talk about infrastructure, it's all about IPs and ENIs. But when you talk about Kubernetes infrastructure, you really don't have a mechanism to say, how is the pod-to-pod communication happening and so on and so forth. And that is the real problem. But with the latest container network observability, you're going to get those enriched metrics, and you're going to have a network flow monitor agent that is going to run in each and every node. And that particular agent is going to give you that information around what is the pod-to-pod communication. You really will understand which of the pod which is being really called multiple times or which of the pod which is being reached multiple times. So with that in mind, you're going to get those enriched metrics into your EKS environment to get you those details. And obviously, with open metrics, you're going to get these metrics visualized in Grafana as well. So with those recent features that we have with Amazon EKS, you can really tell which pod is the one which is being called upon pretty high. And that is going to really help platform teams to troubleshoot those issues.
Bart Farrell: Kubernetes turned 10 years old a while back. What should we expect in the next 10 years to come?
Elamaran Shanmugam: Kubernetes spent the first 10 years on running workloads and containers. I think the next 10 years, Kubernetes will be running everything. Like everything will be run in Kubernetes, especially AI. So with the emergence of model inferencing, the emergence of agentic AI, all these tools that I talked about like agent and agent gateway, Kubernetes is going to be the de facto standard for all of these agentic AI and AI/ML workloads. And then obviously, things like observability will no longer be nice to have, it's going to be must to have. So what I'm seeing in the next 10 years is like Kubernetes is going to be the place where it will run everything, including AI, ML workloads, traditional workloads, and so on and so forth. What's next for you? That's a great question. What is next for me is I want to spend more time on the platform teams, especially on platform engineering to intersect the gap between Kubernetes and AI. So as the Kubernetes ecosystem matures, it is important that we learn from our customers. It is important that I learn from the Kubernetes ecosystem. build reusable patterns for running effective platform engineering and agent platforms on Kubernetes to make sure I make the life of customers easier to run agent platforms and traditional workloads on Kubernetes. So the ultimate goal here is very simple. As the CNCF ecosystems becomes pretty complex with lots and lots of tools. The idea is to make platforms easier for our developers to run agents or traditional workloads and improve developer productivity. That is what I'm going to be focused on for the next few years.
Bart Farrell: And how can people get in touch with you?
Elamaran Shanmugam: Obviously, you can reach out to me in LinkedIn. I'm available in LinkedIn. And I'm just a message away in LinkedIn. And I'm also available here till Thursday. And I also have a green tag, which means that you can come and talk to me with no hesitation. I'm more than happy to talk about it.