The Double-Edged Sword of AI-Assisted Kubernetes Operations
Oct 21, 2025
Host:
- Bart Farrell
This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io
Mai Nishitani, Director of Enterprise Architecture at NTT Data and AWS Community Builder, demonstrates how Model Context Protocol (MCP) enables Claude to directly interact with Kubernetes clusters through natural language commands.
You will learn:
How MCP servers work and why they're significant for standardizing AI integration with DevOps tools, moving beyond custom integrations to a universal protocol
The practical capabilities and critical limitations of AI in Kubernetes operations
Why fundamental troubleshooting skills matter more than ever as AI abstractions can fail in unexpected ways, especially during crisis scenarios and complex system failures
How DevOps roles are evolving from manual administration toward strategic architecture and orchestration
Transcription
Bart: How do you establish workload identity in Kubernetes without hard-coding secrets or juggling expiring certificates? In this episode of KubeFM, Mai Nishitani, Director of Enterprise Architecture at NTT Data and AWS Community Builder, walks us through SPIFFE/SPIRE, open-source standards that let clusters issue secure identities automatically. We'll explore why this matters for multi-cluster setups, zero-trust environments, and the future of cloud-native security.
Special thanks to Testkube for sponsoring today's episode. Need to run tests in air-gapped environments? Testkube works completely offline with your private registries and restricted infrastructure. Whether you're in government, healthcare, or finance, you can orchestrate all your testing tools, performance, API, and browser tests without any external dependencies. Certificate-based auth, private NPM registries, enterprise OAuth—it's all supported. Your compliance requirements are finally met. Learn more at testcube.io.
Now, let's get into the episode. Great to have you with us, Mai. What are three emerging Kubernetes tools that you're keeping an eye on?
Mai: Thank you for having me on this podcast. Crossplane is not really an emerging tool, but I've seen many customers focus on it. Managing cloud infrastructure through Kubernetes APIs is really elegant, instead of looking at Terraform files and kubectl commands separately. You can define your entire stack as Kubernetes resources. Coming from the AWS world, if you wanted an Amazon RDS instance, you can apply a YAML manifest. It's like infrastructure as code, native to Kubernetes, and works well with GitOps.
The second tooling I like is Argo CD. It plays nicely with services from AWS, including Amazon EKS. Think of it as a deployment tool that watches your git repos and automatically syncs whenever there are changes in your infrastructure's code back to your clusters. The beauty of Argo CD is that it flips the traditional CI/CD model on its head: instead of the CI system pushing changes to Kubernetes, Argo CD runs inside your cluster and pulls changes from git. It always matches what's in your repo, so you don't need to wait around.
The UI is also usable for Argo CD, which is rare for Kubernetes. You can see your entire application topology, roll back deployments with a single click, and debug synchronization issues without using many kubectl commands.
Lastly, not really a tool but more of a framework, SPIFFE/SPIRE has come up quite a bit recently. SPIFFE stands for Secure Product Identity Framework for Everyone, with SPIRE being its runtime environment. It's like having a passport for your workloads. Instead of traveling between countries, your services move between different clouds and clusters. Just as a passport proves who you are regardless of which country you're in, SPIFFE does the same for workloads.
With Kubernetes, service-to-service authentication has been pretty messy. You have service accounts for in-cluster stuff, IAM roles for AWS (like IRSA or Amazon EKS pod identity), and potentially custom JWT tokens for external APIs. It's like having different forms of ID for each border crossing. SPIRE is the implementation part—the "passport office" that issues and manages these identities. Having a passport system that works in every "country" so that your services can securely communicate is very useful.
Bart: And Mai Nishitani (works for NTT Data), for people who don't know you, can you give us a quick introduction about who you are, what you do, and where you work?
Mai: I'm a Director of Enterprise Architecture at NTT Data, a former AWS Senior Solutions Architect, and a current AWS Community Builder in containers. I hold all 14 AWS certifications and am a golden jacket holder. Back in AWS, I used to look after 200 plus services. Now, I manage 200 plus vendors across cloud networks, security, traditional data centers, hybrid cloud, data, and AI as a full-stack provider, which keeps me on my toes every day.
How did I get into cloud native? I joined AWS 11 years ago as a systems administrator looking after VMware. That's when my love affair with AWS started—and where I met my current romantic partner. He was an application vendor who hosted applications in AWS, which was quite monolithic at the time.
Our CIO had just returned from a conference and was told that cloud was insecure, so he instructed everyone to shut down anything related to cloud. The application vendor came into a meeting, flustered, and said he needed to move everything from the cloud to on-premises. As the VMware administrator, I was tasked with setting up a LAMP stack. From there, I learned all about AWS and became curious about why the CIO was so scared of the platform. I decided to get certified with the Solutions Architect Associate certification, which progressed my career into cloud.
Bart: Very good. And what were you before cloud native?
Mai: Before cloud native, I was in infrastructure during the time when IBM was the number one largest tech company across the world. I was in tech support, where I learned all about troubleshooting. I progressed from tech support to systems administrator, desktop support, solutions architect, cloud engineer, and then enterprise architect. I spent quite some time in the infrastructure space.
What do I miss from back in the day? Tech is now a commodity. Everyone can be an expert using ChatGPT or other tools to educate themselves, which is a good thing. But on the flip side, I do miss being that anomaly, being that nerd. Back then, it was quite uncool to be techie. Now, we're the cool kids, which feels strange.
I've found that some tech concepts repeat, similar to fashion trends. Take jeans, for example: trends come and go. Right now you have baggy, wide-legged jeans, and soon enough, skinny jeans will probably be back in fashion. It's the same with tech. Now everyone's talking about agentic AI and AI everything, but if you rewind back 11 years ago with the Gartner hype cycle, natural language question and answering was peak technology. Now it's just rebadged as agentic AI and generative AI, which is easier for people to consume. It's the same concept, just rebranded.
One thing is constant: infrastructure runs these AI services. If you have infrastructure, there are going to be upgrades, as people in the audience know. With upgrades come deprecations and new features. Things are going to fail, so you're always going to need someone that looks after the infrastructure.
Bart: The Kubernetes ecosystem moves very quickly. How do you stay up to date? What resources work best for you?
Mai: I go to a lot of events, including some CNCF ones. KubeCon in Sydney in mid-September, I'll be there. I've read many books, such as Gregor Hohpe's platform strategy, which will be relevant for this audience. He dives into what a typical enterprise platform strategy might look like and how to build it out.
Being part of a community really helps. I'm part of the AWS Community Builders Program and have attended AWS Community Day. Shout out to Alan, Dimitro, Stephen, and Lucy, and the AWS User Group. Definitely check out a user group near you.
Podcasts like this one, and conversations with DevOps engineers, are also valuable. Shout out to the two Olgas I know in the DevOps space.
Bart: Very good. As a bonus, you mentioned that you've also hosted your own podcast. Can you tell me about that?
Mai: The AWS SheBuilds Tech Skills podcast focuses on women who are doing really cool things in tech, especially in AWS. The podcast is still ongoing with our North American counterparts. Please check it out if you want to see some awesome women in tech.
Bart: Very good. And a special shout out to one person who I imagine we have in common, Farrah Campbell, who's very active in the AWS space. She's a wonderful person doing lots of great community work. If you're interested in this kind of stuff, I highly recommend following her work. Last but not least, before we dig into the crux of the matter with what you wrote about, Mai, if you had to go back and give yourself one piece of career advice, what would it be?
Mai: So this applies from back in the day. At the moment, I know the job market is a little tight, but if I could travel back in time, I would say don't stay in one job because it just gives you a paycheck. I should have challenged myself and taken more risks while I could. It's okay to fail. I should have created a startup or several startups before the dot-com bust. Yes, I am that old. Currently, we've got the AI boom, so why not get on top of it, obviously, if you don't have a mortgage and don't have any dependents? Definitely take more risks.
Bart: If you do decide to create a startup, we'll be making lots of noise about it. Now, to dig into what we want to discuss in depth today, you wrote an article that we found in our monthly content discovery at Learnk8s called "Scaling and Troubleshooting Amazon EKS Just Got Easier MCP on Anthropic Cloud". We want to dive into this more. You've been in systems administration for over 15 years, starting with tools like Nagios. How has the landscape of infrastructure management fundamentally changed from those early days to today's Kubernetes-centric world?
Mai: Starting back in the Nagios days, I had to compile that from source, and it wasn't easy. Plus, it was running on a VM, so it wasn't highly available. There was nothing to check the checker if that server failed. I was just focused on getting Nagios running. It was very manual at the time.
Going back further, if anyone remembers Patch Tuesday nights after Patch Tuesday in the US, it would end up being on Wednesday or Thursday in Australia time. We used to have pizzas so sysadmins could stay up all night patching servers, click-opsing into the early morning because there was no automation.
Now, there's a lot of tooling available in AWS, like Systems Manager and Patch Manager, that helps automate where it makes sense.
One thing to note: I can't finish "The Phoenix Project" because it gives me PTSD. Spoiler alert: all on-prem services break in that book, which matches my experience. Being on call and having a SAN fail in the middle of the night, I couldn't remote in. I had to come in and restart it. It still didn't restore itself because of a firmware update issue. There wasn't any documentation around this cryptic error, and we were heavily reliant on the vendor to fix things.
Stack Overflow was always closed due to duplicate issues or had solutions without actual resolution, which was frustrating. Breaking things in production (not on purpose) and fixing them under time pressure without documentation—if you haven't experienced those things, you really haven't lived as a techie.
Bart: You mentioned that microservices were supposed to make life easier for DevOps engineers, but they've introduced their own complexities. What's the reality of managing Kubernetes at enterprise scale versus the promise?
Mai: If you're running Kubernetes personally in your home environment, that's definitely no comparison to enterprise-grade critical workloads. Version upgrades can be cumbersome because you need to know not only what's new but what's going to be deprecated, as you'll have to change your application code.
What I've found in large organizations is that even though they say DevOps, in theory, there are DevOps teams and application developers. Usually, these teams in large organizations, unless they're fairly advanced in this space, can be quite siloed—and this includes the traditional infrastructure and security teams.
In the enterprise scenario, getting approval from all these different teams and getting everyone to plan ahead and know exactly what to do to prepare for an upgrade is where it falls short and where many customers struggle. The coordination and planning is the difficult part. The tech is the easy part.
Bart: That's an answer we get quite a bit. When we spoke to Bob Wise, who's currently at NVIDIA but was OG Kubernetes, about the biggest challenges around Kubernetes, he echoed exactly what you just said: As much as we're talking about technical challenges, the human part is actually the most complex thing to coordinate—getting the right people to be doing the right things, focusing on real priorities, and identifying which technologies are truly responding to the problems faced by a particular organization. You described a real crisis scenario where a team was frantically trying to upgrade Kubernetes versions before a forced production rollout. It sounds like a nightmare scenario many folks have faced. What made that situation so challenging?
Mai: If I rewind back to that time, an organization had outsourced management of their Kubernetes clusters to a third party under a managed services agreement. The Kubernetes version from the product perspective was going end of life. There were numerous notifications to the account owner's email that got ignored due to alert fatigue or lack of proper handover from the customer to the new managed services provider.
The customer called us two weeks out from the end-of-life date. We were able to give them an extension of one more week. However, we couldn't continue to keep extending this customer as an anomaly because it was going to be a security risk to have the customer on a different, lower version compared to everyone else across the globe.
The customer realized very late that their application code needed to be updated with the correct software dependencies, which obviously didn't include deprecated components. Luckily, they had a non-production environment, so we could test with the provider helping the customer. However, the provider didn't have the context behind the workloads, so we definitely needed to have the customer involved. It was a team effort to manage the upgrade across all different layers.
What would have helped the team was better detection—using tools like Pluto to detect Kubernetes API versions that could be deprecated. It's crucial to refer back to the Kubernetes API reference and release notes before the end-of-life date. The team also needed to examine cluster add-ons and their compatibility, including ingress controllers, monitoring solutions, and storage provisioners, to ensure they work with the new Kubernetes version.
Bart: This brings us to the elephant in the room: Kubernetes administration is fundamentally changing with AI assistance. You've been experimenting with MCP servers connected to EKS. Do you think this evolution is for the better?
Mai: From the AI assistance point of view, you can learn a skill really quickly. This can supercharge your training and enablement from an individual perspective. Not long ago, we were looking at Google, Stack Overflow, Medium, and Reddit, often encountering incorrect or missing answers.
Similar to DevOps, everything could be automated to a certain point. I've seen DevOps engineers automating themselves out of a job, but you still need to steer the systems. For example, I was playing around with Claude to build a web application hosted on Amazon S3. Claude suggested creating a public S3 bucket, which was acceptable in the early days. However, the current best practice is to host it on private S3 buckets with AWS Amplify.
I realized the information was outdated, so I verified it. If something doesn't seem right, do your research and refer to vendor documentation. Checking the CNCF docs or vendor Git repositories is a good way to ensure you have the most up-to-date information.
Bart: Now let's talk about Amazon EKS Auto Mode, which you describe as an improved Fargate. How does this fit into the broader trend of abstracting away infrastructure complexity?
Mai: EKS Auto Mode is great because it learns from Fargate's limitations. Fargate was groundbreaking with serverless containers in theory, but in practice, especially for EKS, it had significant drawbacks. You couldn't use certain storage types, networking was complex, and you couldn't run daemon sets or GPUs.
EKS Auto Mode takes a different approach. Instead of hiding the infrastructure, it manages the repetitive tasks so you don't have to. You still get EC2 instances, and AWS handles cluster upgrades, security patches, and bakes in best practices automatically.
It runs out of the box with several key components:
Karpenter for compute autoscaling
AWS Load Balancer Controller to expose Kubernetes services via Elastic Load Balancer
EBS CSI to let pods use EBS volumes as persistent storage
VPC CNI to give pods their own VPC IP address for direct communication with AWS services
EKS Pod Identity Agent to handle AWS credentials securely for pods
EKS Auto Mode essentially simplifies day two operations for Kubernetes clusters.
Bart: You discovered an MCP (Model Context Protocol) server that can connect Claude to Kubernetes clusters. For folks unfamiliar with MCP, can you explain what this protocol is and why it's significant for DevOps workflows?
Mai: MCP, which stands for Model Context Protocol, is basically like a USB for AI tools. Before USB-C, every device had its own unique charging cable. I still have a drawer full of cables in case I need them, and I'm sure it's the same for everyone.
Integrating with AI and different tools used to require building custom integrations. If you wanted to work with your monitoring system, that's one custom integration. If you wanted to integrate with your CI-CD pipeline, you'd need to create another. It was quite messy to build and continue maintaining these one-off custom integrations.
MCP standardizes how AI communicates with external tools and data sources. I found an MCP server for Kubernetes that works with Claude. Now, Claude can run kubectl on your behalf, checking pod status, examining logs, and helping with troubleshooting through a standardized interface.
What's really great for DevOps is that it creates a consistent way for AI to plug into your entire tool chain, including monitoring, deployment, and infrastructure. It's almost like moving from the early days of custom APIs to having a universal protocol that makes AI integration much more practical and easy.
Bart: Walking through your demo, you were able to ask Claude about your EKS setup in natural language, and it provided detailed configuration information. How accurate and useful was this compared to running kubectl
commands manually?
Mai: Instead of me having to remember if it's kubectl get nodes
or kubectl get pods
and all the switches, I could just ask Claude to tell me about my EKS cluster in detail. Claude ran all the right commands behind the scenes and automatically checked my nodes, namespaces, deployments, services, and storage classes. It even spotted the game I deployed from the EKS Automode tutorial.
Gathering that kind of information would normally take at least six or seven different kubectl
commands, but Claude was able to do that in one hit. The accuracy was almost spot on because it was running kubectl
under the hood, instead of passing through lots of YAML output. It gave me a summary of what it was running, where, and even picked up on the Bottlerocket operating system and identified node pools as part of the managed configuration.
I wanted to make sure I could test Karpenter, so I said let's scale to a third node. Claude figured it out and created a resource-intensive deployment to trigger the autoscaler. It didn't blindly increase the replica counts but understood that the existing pods were too small to force a new node. So it created something that would actually need more resources.
When I said we don't need the resource-hungry deployment anymore, Claude read between the lines, automatically deleted the deployment, and monitored as Karpenter deprovisioned the extra node. I'll call it operational intuition, not just randomly executing commands.
Bart: Now, one interesting moment was when Claude seemed unsure about whether you were running auto mode or EC2 instances. How do you handle these AI uncertainty moments in production scenarios?
Mai: So that's a perfect example of why you can't just blindly trust AI outputs in production. Claude was trying to help me troubleshoot and made an assumption that wasn't quite right. AI is great at guessing and generating information, but they don't actually know your specific environment unless you specifically provide that information. You're working off probability and training data, not real-time cluster state. That uncertainty was actually valuable information.
In production scenarios, I treat AI suggestions like recommendations from a really smart colleague. They don't have access to my information, but I will still take their guidance, understanding they might not know the full context. It's almost like the zero trust principle: never trust, always verify.
Falling back to kubectl commands is definitely good to have the fundamental understanding as well. I would use AI as an assistant rather than a replacement for your knowledge. Being a 10x engineer is about using AI to increase your productivity rather than replacing your critical thinking skills.
Bart: It's a very nice assessment because, despite the uncertainty and feeling threatened about AI, it's actually complementary and just helps you be more productive. You tested Claude's ability to scale nodes and trigger Karpenter. The fact that it handled this seamlessly and scaled back down gracefully suggests AI understands not just commands, but operational best practices. What are the implications of that?
Mai: At that time, it really surprised me. Claude can not only run kubectl
commands but also intuitively understood the operational flow. It checked the current node utilization first and created a workload that triggered Karpenter's scaling logic. Importantly, it remembered to scale back down.
Scaling back down is crucial. I've encountered customers who left non-prod environments running—not just Kubernetes environments, but also non-prod servers—which ends up costing quite a bit. It's great that Claude automatically thought, "We've proven this works, so let's scale down to avoid wasting money."
This goes beyond a typical Kubernetes tutorial; it's a collection of operational wisdom. AI seems to be absorbing best practices from the collective knowledge of thousands of engineers. It's not just executing commands but learning the why behind them.
Imagine onboarding junior engineers where AI helps not just by running commands, but by teaching proper operational hygiene and preventing 2am production incidents where someone runs a command without thinking.
However, there's a concern: AI can encode both good and bad operational practices. It will depend heavily on the human running it, so there always needs to be a human in the loop.
Bart: Now, this raises a very provocative question that you posed. This can be a bit controversial. I'm sure we'll get some really nice material out of this for a clip, which we will provide in its full entirety—not soundbites that are misleading. But will Kubernetes certifications like CKA and CKAD become obsolete if AI can handle most administrative tasks?
Mai: It's an interesting question because if Claude can kubectl
better than most engineers with CKA, what's the meaning of a certification? I think the value is shifting from remembering the exact syntax for a complex kubectl
command to understanding why you're running that command and what could potentially go wrong.
AI might be able to execute a rolling update, but does it understand the business impact if that update breaks your payment system? I think certifications would focus more on strategic aspects: architectural decisions, security trade-offs, understanding failure modes—things that require real context around your business and systems that AI doesn't have.
With CKA right now, it might be fixing a broken cluster under time pressure, which can get quite difficult. But I think it should move to designing a Kubernetes architecture that meets business requirements, compliance, performance, and cost constraints—similar to the Well-Architected Framework from an AWS perspective. AI could definitely help with implementing that design, but it won't be able to make those critical judgment calls.
Some human skills that would help include understanding organizational dynamics, negotiating, translating business needs into technical requirements, and even pushing back to say, "Kubernetes does not solve everything." Maybe you can look at a serverless solution here—and I'll probably get flak for saying serverless in this podcast. Definitely, we need that element of human decision-making.
Bart: Great points. And for all the serverless haters, we'll see you on Reddit. Now, this brings up a concerning trend that we've noticed at Kube Careers. Kube Careers is a site that we run where we aggregate different jobs in the Kubernetes ecosystem to make it easier for people to find companies that are hiring. But one of the things we're facing is that we're struggling to find junior Kubernetes roles to list. Across the industry, companies seem to be hiring fewer junior engineers in general, partly due to AI capabilities. You've lived through the evolution from manual administration to AI-assisted operations. What worries you about this trend? And what happens in 10 years when we potentially have a generation that never learned the fundamentals?
Mai: We might be creating a massive skills gap where we have senior engineers who've learned on bare metal and junior engineers who only know how to prompt AI, but nobody in the middle that understands both worlds. I'll walk through the transition from manually partitioning disks and click ops to compiling from source and deploying via infrastructure as code. These days, abstraction is good—I don't miss dealing with SANs and bare metal hardware. But here's the thing: when those abstractions break, you need someone who understands what happens underneath the hood. AI is just another layer of abstraction, so it will break in weird ways, and everything fails all the time.
What worries me is the struggling with fundamentals phase that builds troubleshooting muscle. If you've ever suffered the fate of firewall rules blocking traffic to and from source and destination, you had to go through the full OSI layer to see where the problem is. I still believe troubleshooting is the key to success in your tech career. Even in the cloud world, IAM policies still trip me up—how do you scope down permissions and have least privileges without having admin access to everything?
This suffering actually taught me about failing modes that don't show up even when googling. If junior engineers don't go through that process, they wouldn't know where to start. Back in the googling days, a combination of keywords would eventually lead you to a potential solution, but sometimes documentation might not help. Ten years from now, we might face a scenario where all senior engineers are retiring, leaving us with people who can only prompt AI well but might panic when they need to SSH into a node and debug something AI doesn't understand. We need to start educating folks early.
Bart: Great points. Building on that, when you were troubleshooting that Kubernetes upgrade crisis with your team, it was human understanding of the underlying systems that ultimately saved the day. If future engineers primarily interact with Kubernetes through AI interfaces, how will they handle scenarios where AI doesn't have the answer?
Mai: Just randomly running kubectl commands without understanding the workload is a nightmare scenario. What saved the day wasn't prompting for kubectl commands, but collaborating with engineers from the customer end, the managed services provider, and the vendor side who understand the interactions between applications, network policies, storage configurations, and how different Kubernetes versions change those behaviors.
This is why I worry about the generation of black box engineers. If your primary interaction with Kubernetes is asking an AI to run commands for you, what happens in that weird edge case when the application is failing in a way that doesn't match an AI output? The symptoms are so specific to your environment, and every customer environment is different.
Another issue is that AI can make day-to-day operations smoother, but it might be completely misplaced when it comes to keeping up with the latest feature additions and deprecations. You need to refer to CNCF documentation around the latest release notes, not Claude or any other AI tool. The human element is super important to always verify.
Bart: You've painted a picture of using this technology during those dreaded 2am troubleshooting sessions when your brain isn't working optimally. How could AI assistance transform crisis management in DevOps?
Mai: At 2am when you're troubleshooting, having an AI tool help you is a game changer. When you're half awake and production is down, you're staring at logs—lines and lines of logs—and the coffee isn't kicking in anymore. This is where having AI as your expert gets really interesting.
It doesn't get tired, doesn't panic, and doesn't skip the obvious stuff. How many times have we spent an hour debugging and going through logs, thinking of weird complex edge cases, when the actual solution was staring us right in the face?
I would picture AI as a methodical troubleshooting partner. When you're seeing 500 errors, it might ask you to check the ingress controller logs first and then work your way back through the stack. It gives you a systematic approach when your brain isn't functioning.
AI could also help with the communication side. When an incident occurs, you need to write a post-incident report (PIR), which is the last thing you want to do before going to bed. Having an AI tool help draft the post-mortem while the details are fresh would definitely help you get to bed early and send an update to the necessary stakeholders.
Bart: From a practical standpoint, what are the current limitations of using AI for Kubernetes operations, and what should teams be cautious about when implementing these tools?
Mai: AI for Kubernetes operations is amazing, but not ready to be deployed in production without guardrails. AI hasn't got real-time awareness of your cluster state. It might suggest scaling up nodes when you're already at capacity or restart a pod that's currently handling critical traffic. It works off patterns and documentation, providing a guesstimate as to what would be best for your clusters.
Human judgment is super important, especially on the permission side. If you're giving AI access to kubectl, you need to be thoughtful about RBAC. I've seen demos where people give it cluster admin just to make it work, which is not best practice. You need to think about scoping down those permissions—perhaps give it read-only access for most things and write access to specific namespaces or resource types.
Version compatibility is also crucial. A command that worked in Kubernetes version 1.20 might break in 1.28. AI is not great at tracking which features were deprecated and when, and the newest releases might not be in its training data unless you provide that information.
Once again, this comes back to human fundamentals and troubleshooting.
Bart: Now, looking at the broader intersection of DevOps and AI beyond just Kubernetes, where else do you see AI making the biggest impact in infrastructure operations?
Mai: Some of the places where AI is starting to be productive are around monitoring and incident response. Instead of setting up alerts that lead to alert fatigue, AI can sift through data and understand normal patterns for specific workloads, then flag genuine anomalies. Security scanning is another significant area. It's not just for CVEs; AI can spot suspicious patterns within infrastructure configs that a human might miss, such as a service account with more permissions than it uses, or a network policy change that opens a path to a database.
Infrastructure as code is getting more sophisticated, so AI can review Terraform and not just check syntax but understand architectural anti-patterns. It might identify a single point of failure or highlight a setup that will be quite expensive at scale.
Coming back to operational routines, AI can help eliminate boring tasks like log rotation, certificate renewals, and basic health checks—things that often consume engineers' time. This frees humans to focus on more creative and strategic work, including architectural design and solving actual business problems instead of babysitting routine jobs.
Bart: For DevOps teams and platform engineers who want to start experimenting with AI-assisted operations, what would you recommend as first steps?
Mai: I would always recommend starting small with guardrails and not jumping straight into having AI manage your production clusters, even though I've got a demo that looks cool with Claude. Start with non-production environments first, which should be obvious.
Low-risk applications could include helping write and review infrastructure as code, generating Terraform modules or Kubernetes manifests. Always review these like you would any pull request. It's a great way to see how AI thinks about infrastructure patterns and verify the output.
Documentation is an excellent starting point. AI is fantastic at improving existing runbooks or generating documentation based on your current configuration. I would also suggest Amazon's spec-driven code generation tool called Kiro, which helps generate user stories for specific projects based on your prompt. I've been using Kiro, and it's been fantastic in establishing a baseline for gathering requirements from different project stakeholders.
I believe the technical side is the easy part, but gathering requirements to start a project is difficult. For anything touching production, make sure to start in read-only mode. AI can help by interpreting logs, looking at metrics, and suggesting troubleshooting approaches, particularly in operational excellence.
Bart: Now, as someone who's been through the evolution from manual system administration to AI-assisted operations, what's your vision for the future of DevOps? Will the role fundamentally change, and how do we ensure we're preparing the next generation properly?
Mai: I think we're heading towards a world where DevOps engineers become more like architects and strategists. That is why I'm investing heavily in the architecture track. I feel that we're going to move towards being an orchestrator of AI agents. The really boring stuff like patching systems, deployments, and basic monitoring may be automated away by AI, which is great because who enjoys staying up for a routine deployment at 2 am?
This means the human role shifts upwards. We're going to be the people designing systems, making architectural decisions, and understanding business requirements to help translate those into infrastructure strategies. It's more about moving away from configuring a load balancer to defining our disaster recovery strategy for a multi-region outage across multi-clouds.
The challenge is to not lose the foundational knowledge that makes higher-level decisions possible. You can't design a resilient system if you don't understand how different components fit together. As Andy Jassy likes to say, there's no compression algorithm for experience.
I would say a future DevOps engineer is someone who can rapidly prototype infrastructure with AI, but also be able to dive deep when AI suggests something is wrong. They're comfortable asking, "Claude, can you help me implement this architecture?" while having designed it with an understanding of the underlying trade-offs.
Bart: Now, Mai, what's next for you?
Mai: I want to get more comfortable having strategic discussions with non-technical folks like the chief financial officer, CEO, and COO. Being able to translate complex technical ideas into simpler language that non-technical people understand is important. Explaining concepts as if you're talking to a five-year-old helps in understanding the concept yourself.
I've been heavily investing in the CNCF side of things, particularly in FinOps, and I'm working towards becoming a FinOps professional. I also continue to read books. Gregor Hohpe's "Software Architect Elevator" is really useful for anyone wanting to understand how to communicate with different audiences, from technical engineering teams to non-technical executives.
Bart: Mai Nishitani suggests people can get in touch with her through her work at NTT Data, or potentially through her AWS SheBuilds Tech Skills podcast, as she is an AWS Community Builder.
Mai: You can reach out to me via LinkedIn. I'll be posting content mostly about containers and AI, sometimes drones, and about events that I'm attending. If you see me out there, come and say hi. Thank you very much for having me, Bart.
Bart: Wonderful to have you, Mai. I look forward to crossing paths with you in the future, whether online or in person. Keep up the amazing work. We'll be in touch. Take care.