Platform engineering and the evolution of enterprise Kubernetes
This interview explores how large financial institutions manage Kubernetes at scale, covering everything from cost optimization to the future of platform engineering.
In this interview, Sai Sandeep Ogety, Director of Cloud & DevOps Engineering at Fidelity Investments, discusses:
Managing hybrid cloud infrastructure at enterprise scale, combining on-premises and public cloud while maintaining consistent operations and addressing compliance requirements
Platform engineering challenges including cost optimization with tools like KubeCost, bridging the developer knowledge gap and managing stateful workloads with Kubernetes Operators
The future of Kubernetes operations with AI/ML integration, focusing on cluster optimization through multi-tenancy and predictive maintenance using tools like K8sGPT
Relevant links
Transcription
Bart: Can you explain a little bit about who you are, your role, and where you work?
Sandeep: I'm Sai Sandeep Ogety, a cloud and DevOps expert with over 12 years of experience in the IT industry. I specialize in managing cloud platforms such as AWS, Azure, and GCP, and have a strong background in CI, CD automation and cloud security. In my role, currently I'm a technical lead for the cloud and DevOps engineering at Fidelity Investments. I lead initiatives to build scalable and secure cloud infrastructure solutions, possibly utilizing Kubernetes Operators, StatefulSets, or Persistent Volumes and Persistent Volume Claims. I'm also the founder and organizer of the Research Triangle meetup group, fostering a community of over 3,300 members in the Raleigh Durham area, where we might discuss topics like Multi-tenancy in Kubernetes, Network Policies, or Resource Requests and Limits, and tools such as Lens, KubeCost, Prometheus, Datadog, Splunk, Elasticsearch, or ELK Stack.
Bart: What are three emerging Kubernetes tools that you're keeping an eye on?
Sandeep: I'm keeping an eye on Lens and OpenTelemetry, as they are correlated to platform streaming. The third area I'm interested in is tools for managing automation, such as connecting to clusters, which can be challenging since we deal with many clusters from providers like Google, Amazon, and Microsoft Azure. Having a tool that allows immediate connection to clusters with a single command would be beneficial for simplifying my day-to-day tasks.
Bart: We've been discussing [platform engineering](What is platform engineering?) for a while, and it has gained significant traction in the space. In terms of working at a large-scale financial institution with different [clouds](What clouds are being referred to?), what are some of the challenges that [platform engineers](What are platform engineers?) face on a day-to-day basis?
Sandeep: Especially on cost optimization, I'll give an example and tell you how this is a most emerging topic for platform engineers. Bridging the knowledge gap for developers who aren't Kubernetes experts is crucial for organizations aiming to optimize their pod-level configurations effectively. One effective strategy is to establish a centralized team of Kubernetes specialists. This dedicated group can manage the configuration and operation of Kubernetes clusters, providing support and acting as internal consultants for development teams. By centralizing expertise, organizations reduce the necessity for every developer to become a Kubernetes expert, allowing them to focus on application development while relying on the specialized team for Kubernetes-related tasks. Another approach is involving cross-functional collaboration, encouraging knowledge sharing between traditional developers and operations professionals. This helps create a more holistic understanding of systems in place and builds teams that can effectively communicate complex technical concepts across departments, promoting a DevOps culture and breaking down silos through open communication and shared responsibilities. Additionally, implementing tools that provide best practices, automation, and safeguards can assist in bridging the skill gap. These tools allow teams to learn from manageable cases while reducing the risk of operations associated with a lack of expertise. By adopting these strategies, organizations can effectively bridge the knowledge gap, enabling developers to handle pod-level optimizations confidently and enhance cost-effectiveness, while fostering a culture of continuous learning and collaboration, driving innovation and operational excellence within the organization.
Bart: We are discussing the subject of clouds. We've seen or heard in different conversations that people are moving away from the cloud in the last few years. Is that something that's happening at Fidelity, with more workloads being run on-prem?
Sandeep: No, actually, because we have a pretty solid footprint both in on-premises and cloud. We have on-premises as well. We actually have an internal tool that we built, where we can spin up on-premises clouds. A lot of applications that are compliance-driven have to be on-premises, but we still treat them as cloud. The only difference is that we manage our own cloud, which is not a public Cloud Service Provider (CSP) such as Amazon or AWS or Azure. The good thing about it is that, with my experience at Fidelity, the roles and responsibilities are getting similar, whether you are working on-premises or as a cloud engineer. We are building a hybrid cloud that will help both traditional operations teams and the latest cloud engineers, as well as emerging technologies teams. This convergence is also related to what you mentioned earlier about OpenTelemetry.
Bart: In your experience, we've seen a lot of tools coming out in the observability space, such as OpenTelemetry, Splunk, Elasticsearch, ELK Stack, Prometheus, and Datadog. However, it's no secret that for Kubernetes, there are tons of tools, and sometimes this includes tools for tools and dashboards for dashboards. What have you done in your work to try to avoid tools?
Sandeep: I would say that everybody in the organization is moving towards open source technologies and tools. We are actually cutting down on a lot of tools. In my experience, within various organizations, including fintech, retail, and telecom, each team operates as a business unit with its own set of tools. For instance, one organization may use Splunk, while another uses Elasticsearch or an ELK Stack. With the adoption of open source technologies, every organization is trying to build its own set of instances and feed data from Prometheus or other data metrics to telemetry. Many organizations are adopting OpenTelemetry these days because it provides a comprehensive monitoring stack. I have spoken with industry leaders through community groups, and they often mention that they are moving towards open source technologies. Open source technologies are emerging, and people want to try them out and explore various scenarios. One example is when an organization decides to use open source tools, but leaders question the Service Level Agreements (SLAs) and Service Level Objectives (SLOs). They wonder what would happen if a feature working in production goes down and who would be responsible. Leaders are taking a keen interest in addressing these concerns. However, apart from these concerns, everybody is trying to use open source tools, including OpenTelemetry.
Bart: Great answer so far. And now let's get into the three questions that we spoke about earlier. If we're talking about stateful applications and databases, one of our guests, Paul, argued that we don't store state in Kubernetes because as long as you can treat Kubernetes as stateless, it's a nice abstraction and assumption to use for day-to-day management. How do you manage StatefulSets in your Kubernetes, considering the use of Persistent Volumes and Persistent Volume Claims?
Sandeep: Managing stateful applications in Kubernetes requires a strategic approach to ensure data persistence and reliability. Kubernetes, traditionally designed for stateless workloads, has evolved to support stateful applications through features like StatefulSets, Persistent Volumes (PVs), and Persistent Volume Claims (PVCs). StatefulSets are essential for deploying stateful applications as they provide each pod with a unique identity and stable storage, ensuring consistent data even as pods are scheduled. This is particularly important for applications like databases, where data persistence is crucial. PVs and PVCs facilitate dynamic storage management, allowing applications to maintain state across restarts. By defining PVCs within StatefulSets, each pod can be associated with its own storage, ensuring data persistence across pod rescheduling or restarts. Implementing headless services in conjunction with StatefulSets allows for stable network identities, enabling each pod to have a consistent DNS endpoint. This setup is particularly beneficial for distributed databases and other stateful services that rely on stable network identities. Additionally, following best practices such as separating stateful components from stateless ones, using appropriate storage solutions based on specific use cases, and implementing backup and disaster recovery solutions are crucial for effectively maintaining stateful applications in Kubernetes. By leveraging these features and best practices, it is possible to effectively manage stateful applications.
Bart: Also, I have a decent background in running stateful workloads on Kubernetes from managing a community focused on that topic. What has been your experience with Kubernetes Operators on Kubernetes, particularly in running stateful workloads, databases, etc.?
Sandeep: So, Kubernetes Operators are really essential for huge workload management in any organization that lists workloads in production. Operators can be customized based on the desired behavior. With my experience, I've seen that every organization or team builds their own operators for specific use cases. For example, if you want to schedule jobs, run batch jobs, or need features that are not supported by the community, AWS, or Kubernetes, you can build your own operator. Operators are key for teams managing their workloads effectively, and they can customize their applications for scalability. Although some teams have moved away from building their own operators due to technology limitations, I've seen a recent trend of teams trying to learn and build their own operators. I'm actually mentoring teams to write operators within our organization for specific use cases.
Bart: On the subject of platform engineering, specifically cost optimization, one of our guests, Kensei, suggested that developers should ideally handle pod-level optimization, such as setting Resource Requests and Limits. How can organizations bridge the knowledge gap for developers who aren't Kubernetes experts?
Sandeep: Cost optimization, especially at the pod level, refers to the practice of minimizing the cost of running individual Kubernetes pods by carefully managing their resource allocation. This is primarily achieved by adjusting the number of CPU cores and the memory assigned to each pod based on actual usage, ensuring that you only pay for the resources you use and your application truly needs. Techniques like auto-scaling, Resource Requests and Limits can help achieve this. Setting the right set of limits on resources for the deployment of an application is also crucial. It's essential to keep a close eye on the cost of running your clusters and identify any pods or instances that are not using the allocated resources efficiently. For instance, if an application is running with eight vCPUs and 20 gigabytes, but only using 20% of the allocated resources, you should right-size the limits and reset them to avoid increased costs. Another key point is the use of spot instances, which utilize cloud providers' discounted prices, allowing you to pay a lower price for instances that only run when needed. Monitoring and analyzing your environment using tools like KubeCost, Datadog, or Apptio can also help with cost optimization. These tools provide a holistic approach to cost optimization and can help identify idle resources and opportunities to resize clusters. By using these tools and consistently monitoring costs, you can achieve a robust understanding of how to cut down costs. In our organization, we have implemented these strategies and have cut down costs by up to 30 percent. This is an ongoing, monthly effort to maintain cost stability and ensure that costs continue to decrease over time.
Bart: On the subject of observability and monitoring, our guest explained that while monitoring deals with problems that we can anticipate, for example, a disk running out of space, observability goes beyond that and addresses questions you didn't even know you needed to ask. Does this statement match your experience in adopting observability in your stack, perhaps using tools like Splunk or Elasticsearch as part of an ELK Stack?
Sandeep: With my experience, adopting observability within our technology stack has significantly enhanced our ability to understand and manage complex systems. While traditional monitoring focuses on predefined metrics and alerts to identify known issues, observability provides a comprehensive view by collecting and analyzing telemetry data, such as logs, metrics, traces, to offer insights into the system's internal states. This approach enables us to proactively identify and address unforeseen issues, improving system reliability and performance. Implementing observability has also allowed us to detect anomalies and understand the root causes more effectively. For instance, by analyzing distributed traces, we have been able to pinpoint latency issues across microservices parts running in clusters, leading to more effective troubleshooting and cost optimization using tools like KubeCost. Moreover, observability has facilitated a more proactive approach to system management by continuously analyzing telemetry data from tools like Prometheus, Datadog, Splunk, Elasticsearch, and ELK Stack. We have been able to anticipate potential problems before they occur, allowing for timely interventions and managing a seamless user experience. This practice is a significant advancement over traditional monitoring practices, which often react to issues after they occur. Another example is with artificial intelligence and machine learning operations, where you can create a bot and deploy it as a node instance in every node DaemonSet, and it can analyze your applications to predict potential downtime using tools like K2GPT and guidance from the CNCF AI Working Group. It can alert engineers before the downtime occurs, and I have seen leaders doing this as a proof of concept with pretty good results. I am amazed to see how AI and ML ops will merge to become robust tools for engineers, platform engineers, infrastructure engineers, and DevOps engineers, helping them in their day-to-day life by analyzing metrics and historical data to predict potential issues and provide recommendations for action.
Bart: Looking at the next 10 years of Kubernetes, since we celebrated 10 years of Kubernetes in 2024, is it expected that more [AI](Artificial Intelligence) and [ML](Machine Learning) will become a regular part of people's lives. We see tools like K2GPT. We have the CNCF AI Working Group now in the CNCF. What are other things expected to happen in the next decade of Kubernetes?
Sandeep: One of the robust topics that everybody is looking at is how to manage clusters across multi-environments. For example, an organization with 2000 clusters, running every workload, including Windows and Linux workloads on Kubernetes, is trying to reduce the number of clusters to 400. The cost of managing these clusters, especially on GCP, Azure, or AWS, is at peak, particularly for the management node. To address this, organizations are exploring Multi-tenancy in Kubernetes, which involves isolating environments within a cluster using Kubernetes concepts instead of creating new clusters. Multi-tenancy is expected to become increasingly important as organizations try to rearchitect their clusters and reduce costs.
Another emerging trend is the use of AI and ML to improve day-to-day operations. With the advancement of GPTs, organizations are building their own GPTs, such as a K2GPT for Kubernetes. By feeding historical data into these GPTs, they can become a go-to tool for troubleshooting. Open-source technologies, including Kubernetes, are likely to develop extended GPTs to help developers, providing a reliable source of information instead of searching multiple data sources.
Finally, cost optimization is becoming a key focus area. Now that many organizations have achieved containerization, they are looking to optimize their application design to cut costs. This is an emerging trend, with organizations taking steps to reduce their expenses and improve efficiency, potentially using tools like KubeCost or Apptio.
Bart: When we ask people about their least favorite Kubernetes features, some things that get mentioned are overall complexity. This came up earlier in our conversation about the knowledge gap. Another thing, more in the technical sense, is Network Policies management, which is something that comes up a lot as particularly painful for many folks. Although some improvements are being made, such as new APIs related to Gateway API, in your experience, what would you say is your least favorite Kubernetes feature, the one you would like to see improve the most?
Sandeep: I would say that especially when integrating with multi-cloud, there is a lack of features that allow for seamless interaction between different cloud service providers. For instance, let's consider a real-time example from my experience. Suppose you have a Kubernetes cluster running in AWS and you find that storage is cheaper in Azure compared to AWS S3. It would be beneficial to have a feature that allows you to integrate with your data source sitting in Azure, potentially using Persistent Volumes or Persistent Volume Claims. Currently, this requires a lot of tweaking and peeking. Networking is also crucial when trying to communicate with different clouds, given the complex architecture that spans from on-premises to cloud and cloud to on-premises, as well as cloud to cloud connections. Many organizations are moving towards cloud to cloud connections, and I believe that features should evolve to be cloud-agnostic, enabling communication with different cloud service providers, which could be facilitated by Gateway API. I have seen many use cases where people are using multi-cloud and hybrid environments, such as on-premises and cloud, which relates to Multi-tenancy in Kubernetes. Regardless of where your cluster is situated, it should have the capability to interact with different cloud service providers, which is an interesting development that I would like to see in the future.
Bart: If people want to get in touch with you, what's the best way to do it?
Sandeep: I'm active online and also manage a lot of meetup groups on LinkedIn. People can sign up to my meetup group, and I can share the link. I host sessions on emerging topics at least every two weeks, and interested individuals can join in, introduce themselves, and discuss with me. Additionally, I'm available on LinkedIn, and people can reach out to me anytime.
Bart: Perfect. Thank you very much for joining. I look forward to speaking with you in the future. Take care, thank you.