From ECS to Kubernetes: A Real Migration Story

Feb 24, 2026

Host:

Bart Farrell

Guest:

Radosław Miernik

This episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

Migrating from ECS to Kubernetes sounds straightforward — until you hit spot capacity failures, firewall rules silently dropping traffic, and memory metrics that lie to your autoscaler.

Radosław Miernik, Head of Engineering at aleno, walks through a real production migration: what broke, what they missed, and the fixes that made it work.

In this interview:

Running Flux and Argo CD together — Flux for the infra team, Argo CD's UI for developers who don't want to touch YAML
How the wrong memory metric caused OOM errors, and why switching to jemalloc cut memory usage by 20%
Splitting WebSocket and API containers into separate deployments with independent autoscaling

Four months of migration, over 100 configuration changes in the first month, and a concrete breakdown of what platform work looks like when you can't afford downtime.

Listen anywhere

Transcription

Bart Farrell: In this episode of KubeFM, we're joined by Radek, who's the head of engineering at aleno. This episode is a deep dive into a real migration from AWS, ECS, Fargate to Kubernetes. What broke, why it broke, and the concrete fixes that made the platform stable. We start with the constraints that forced the change. Spot capacity failures in ECS and an unexpected entry point into Kubernetes through CI. Scaling self-hosted GitHub action runners with the GitHub ARC operator. Radek then walks through the production setup. Multi-environment clusters. a full observability stack, and a GitOps workflow using Flux for infrastructure changes and Argo CD's UI to make deployments accessible to developers. We also cover production-only edge cases, DNS cutovers, auto-scaling limits, Nginx rules dropping real traffic, VPC peering gaps, and the memory realities of running Meteor.js and WebSockets on Kubernetes, including allocator changes that reduce memory usage by around 20%. If you're migrating from ECS to Kubernetes, this is a grounded look at what platform work actually looks like in production. This episode is sponsored by LearnKube. Since 2017, LearnKube has helped Kubernetes engineers from all over the world level up through Kubernetes courses. Courses are instructor-led and are 60% practical and 40% theoretical. Students have access to the course materials for the rest of their lives. Courses are given in person and online to groups as well as individuals. For more information about how you can level up, go to learnkube.com. Now, let's get into the episode. Welcome to KubeFM. Radek, welcome to KubeFM. What three emerging Kubernetes tools are you keeping an eye on?

Radosław Miernik: To be honest, there's not really that much. I'm looking out for some improvements to the tools that we are currently using. So I'm looking up, for the new things coming to Argo. I saw a couple of things regarding workflows. I saw a couple of things regarding Karpenter. And besides that, nothing really.

Bart Farrell: Fair enough. And for people who don't know you, can you tell us a little bit more about what you do and where you work?

Radosław Miernik: So I'm the head of engineering at aleno and we are creating software for restaurants. And that's my day-to-day job. I also work at the university part-time because I'm finishing my PhD, which is taken out of time as well. But then my day-to-day is mostly about creating the application as it goes. So creating features, fixing bugs, all that stuff. And there is some infrastructure here and there, but it's mostly about optimizing costs and also taking care of the new features that we release. So they also work quite well on a daily basis.

Bart Farrell: Okay. Since you did mention that you're studying for your PhD, as you said, that does take up a lot of time. But if you can just summarize in a short period of time, what are you getting your PhD in?

Radosław Miernik: So this is mostly about games. So games, not as in computer games, but game theory, simulation. You can read more on my blog if you want to. I don't want to, dive very much into academics in here.

Bart Farrell: Okay, fair enough. And how did you get into Cloud Native?

Radosław Miernik: So how it worked for me is that when I started working as a software developer in general, we were working with, self-hosting tools, we would call them today. But we just had a whole application made up of, 99% scripts that were copy-pasting source code and then running mostly PM2 for years. Then the era of cloud hosting with all sorts of Heroku and things like that started. And then at some point when we were growing... as a company, we needed something, more reliable, more scalable, or more approaching to our needs. We started working with Docker and it was ECS for a long time, and then on someone else's Kubernetes.

Bart Farrell: Okay. And what were you before Cloud Native?

Radosław Miernik: So before, as I said, mostly PM2 and very, homemade scripts here and there. It was Ansible for some time, and I don't know if it's... Cloud-native or this is tools to call it that. But mostly PM2 and then Docker.

Bart Farrell: Okay. And, you know, the Kubernetes cloud-native ecosystem moves very quickly. What works best for you to stay up to date in terms of resources?

Radosław Miernik: So what I do is I have a very long list of subreddits subscribed via RSS. And I read them, weekly. Plus, I have very friendly and very up-to-date DevOps at hand. and he is keeping up with all of those things and pinging me, hey, something new came up, something new is there. Plus, if you really want to have the exec summary, I recommend signing up for the newsletters or also via RSS to the big platforms you're using. So, for example, AWS releases. This is quite nice because the title often indicates enough if you want to read it or not.

Bart Farrell: If you could go back in time and share one career tip with your younger self, what would it be?

Radosław Miernik: First of all... don't hesitate to ask tough questions, even if it's for people you are afraid of. So for example, if you are talking with your boss and you are having some problems, just do it. Or if you're stuck at some technical issue, Just go to the person who may have the answer, even if you think they don't like you, and just ask it. Because it may be the case that it's not only you having the problem, it can be someone else having the problem as well. And if no one in your area knows it, then file an issue on GitHub or try Reddit or something else. And try to pick someone else's brain.

Bart Farrell: As part of our monthly content discovery, we found an article that you wrote titled Karpenter at Beekeeper by LumApps. Fun stories. So your team recently completed a migration from AWS ECS to Kubernetes. Before we get into the journey, what did your infrastructure look like before? And why had you originally chosen ECS with Fargate?

Radosław Miernik: So for us, ECS was a very easy way of starting with moving our application from some proprietary platform to something less pings to a single vendor. So what we did is we started with creating a Docker image for our application. and we went for something that can take the Docker image for us and handle the autoscaling itself. So ECS was there and it was working for many years for different projects we were working on, because different people on the team were working on different projects in the past obviously, and we went with that and it was serving us quite well for a long time. So we created a couple of clusters, one cluster for the production, a couple of clusters for the test servers, separate clusters or separate tasks for sidecars. For example, we had some API gateways, some containers that synchronize data between databases, things like that. And then the application, of course, configured the auto-scaling rules, networks, the basic stuff, and rolled it out. And it worked pretty well for quite a while. So we let it be. And we didn't touch it for, I think, at least a couple of years.

Bart Farrell: Spot instances saving 65%. Sounds great on paper, but I understand things weren't running smoothly. What started breaking down?

Radosław Miernik: So it did save a lot of money. Compared to if we looked into AWS pricing or the billing summary, it really showed that we, on average, saved 60, 65. Sometimes 40% is really dependent on the time of the year. But the problem was that at some point when we reached 20 or even more instances or containers that we wanted to run in production, we got some error from AWS. What it meant is we don't have this capacity for you now. And it works well. It was cheap, but there was no option to, for example, say, okay, I want to use spot instances if they are there, but if they are not, I want to use something else. So what you can do is you can configure it in a way where you have a fixed number of instances on the non-spot instances and a fixed number of the spot instances, plus auto-scaling on both if you want to, but there's no built-in fallback mechanism. And that was really a deal-breaker for us.

Bart Farrell: Now, weekly outages despite trying everything within ECS, but production reliability wasn't actually your first reason for exploring Kubernetes. What initially pushed you in that direction?

Radosław Miernik: So we started with Kubernetes mostly because we were using GitHub Actions and are still to this day. I highly recommend it. And we're using it for a lot of things, but most importantly, building the application Docker image. running the tests, all sorts of tests, end-to-end tests, unit tests, integration tests, and also some other important things. But the problem was with the end-to-end tests, that when we hit a certain level of complexity, our end-to-end tests, start the application, Redis cluster, Elasticsearch, some tools, some sidecar tools next to that, and also the browser itself to run the end-to-end tests, we run out of resources. So what we had to do... is we had to switch to the GitHub larger hosted runners, and we did. And it was fine, as in it was working, but it became really expensive at some point. Mostly because the tests were often slacking, as in it felt like the containers we were given, or the containers that were running our end-to-end tests, were very different in terms of resources. Sometimes you got a very fast CPU, Sometimes we got a really slow one and sometimes the end-to-end tests took 20 minutes, sometimes it took 60 minutes. And because of it, we really had to do something about it. And on the other hand, when we tried to self-host the actions runners, it worked really well because we can select the node type and the CPU run and whatever and do it. But the problem with the self-hosted ones was that they were not autoscaling. So for example, we had to pay for the weekends or the nights when nobody was working. So no tests were running in there. So what we did is we set up the GitHub ARC, I don't remember the full name, but it's like GitHub ARC for the GitHub Actions. An operator that actually self-hosts the GitHub Actions containers, or the runners, and then scales it depending on the amount of tasks you have queued in your GitHub Actions.

Bart Farrell: So CICD runners were your way into the Kubernetes ecosystem. And once you saw the potential, what did the full production cluster architecture end up looking like?

Radosław Miernik: So what we did at first was the entire observability setup. So we set up the basic tools like Grafana, Prometheus, Loki at some point as well, and wired it up with the Nginx, Gateway, and all of these tools around it, and then started with the basics. So we started with the sidecars first, so Redis for cache, some sidecars that I was mentioning earlier with connecting the data here and there, and then the actual Docker container with the application itself. And we did it in a way where we can easily replicate our environments because we have at the moment six testing environments, plus staging, plus some more. And we wanted to make it in a way where we can play with the configuration between the environments. So for example, some of them have more CPU, some of them have more RAM, some of them have a different autoscaling configuration. And based on that, we had our basis for the application to work. And at some point, we tried with the... production and it was as easy as creating another environment for the production. But then once we did that, there was also a lot of tooling needed to make it happen. So for example, we have the Argo CD for the deployment of the application, we have Flux for GitOps of the configuration, for example, if you want to update Grafana or Prometheus, we also have to do it somehow. So we have Flux for that and there's a lot of different tools in there. Docker image cache, things like that.

Bart Farrell: You're running, and I would say this is probably the first time we've seen this on this podcast, you're running both Argo CD and Flux for GitOps, which might seem redundant and certainly a way to spark a lot of debates about preferences in the audience. But what's the reasoning behind using both?

Radosław Miernik: So the thing is that our team has a couple of people who are more versed with infrastructure in general, and those people are not... They're still afraid sometimes to touch certain points, like certain versions, but they are aware of the consequences and they know what to do if something goes wrong. And on the other hand, we have people who are just playing with the application and they want to, for example, release a new version or they want to change some configuration of the application. And those people have different needs. So for the first group, we have Flux. And Flux is as easy as commit, push, and it works. Argo is the same, but in Argo you still have the UI. And in the UI, we login developers, actual developers, even front-end developers or full-time developers, they go into Argo, they can see the configuration, they can see the phase of the rollout, for example, how many containers are there, on which nodes they are, if needed, to debug some things if needed. And then they can also go into the configuration of the service, change environmental variables, for example. They can do it by themselves without looking into the code, without understanding the all of the YAML files underneath. They can do it in the UI and it works for them.

Bart Farrell: With this tooling in place, you had seven non-production environments plus production to migrate. How did you approach the rollout?

Radosław Miernik: So for the test, it was easy because we could switch off some test server and start it new on the Kubernetes, switch it on, switch it off on the ECS and then switch the DNS. And it was as easy because we didn't really... don't really care about downtime on the test environments. But then when we wanted to go for the production, what we did is we set up both of them together, switched the DNS, and then in a way forced the production to scale down instead of just turning it off. So we didn't remove the cluster, we scaled it down more and more so the customers, so the clients, the users, were reconnected to the Kubernetes cluster and it worked quite well until it didn't. because we made some typos and some issues. We had some issues with the configuration, because we copied a test environment and some of the environmental variables were incorrect, some of the auto-scaling rules were incorrect. For example, we had too few maximum number of containers. Some of the configurations were also copied incorrectly. For example, we made some typos in the environmental variable name, which resulted in some containers doing double work, because they were configured as a worker and an API gateway, for example. So there was a lot of hassle, but I would say that's kind of expected if you switch platforms that different from each other. And one more thing was an issue, a couple of hours after we switched everything was that first of all we had some problem with firewall because on ECS we used the default we didn't set up anything additionally and in here we went with the I think it's most secure or something like that in Nginx and it was working quite well it was really nice and we went with the strictest possible settings which turned out to cut off 5-10% of some production traffic. Not all production traffic, but some production traffic. We didn't really see it because we assumed there's always some noise. We assumed we always have some requests against the firewall that we rejected that we didn't see before because we didn't get those metrics. And now we saw them, so it was fine. But then it turned out that this was a third party failing to call us properly because they had some obscure HTTP headers that the ModSecurity rejected. And a couple of weeks later, we also realized that we forgot about the VPC peering configuration with our database. So we paid a couple of hundred extra for the traffic. But that's the price.

Bart Farrell: You know, non-production went smoothly. But what happened when you switched production?

Radosław Miernik: As I said, all of those things with configuration, performance, VPC, all of those hit us at different rates. Overall, it took us a couple of weeks to finish and ease it off. But the actual switch, it was just 5-10 minutes to move all of the connected users from the ECS cluster to the Kubernetes cluster. Because I didn't say it earlier, what is important in here is that we have a very strict distinction because we have some containers. that are serving the API requests, like GraphQL, REST API, that kind of stuff. And then we have some containers that run the cron jobs, and those are easy because they don't really get any requests incoming. They just talk to the database, execute some third-party calls, things like that. And then we also have WebSockets containers. And those were the hardest to migrate because those are stateless. So whenever we scale down and people are reconnecting, this is incurring some additional costs on the... containers that got the traffic because they have to initialize the connection, they have to fetch them some data into memory, and those were impacted the most. Because from the API perspective, you got a separate ECS cluster that happens to be on Kubernetes, so that was easy. But for the WebSocket ones, people were reconnected over and over for some time.

Bart Farrell: In ECS Fargate, you just specify CPU and memory abstractly. How did you figure that out for different instance types that were causing the performance problems?

Radosław Miernik: So at first with ECS what we did is we tailored it to whatever we needed. So for example, we saw that we never go to full CPU, so we went for example for 0.9 CPU and we saw that we never went over three gigabytes of RAM, so we went for three gigabytes of RAM instead of the maximum four to one ratio and it didn't work because it also affected the availability of the resources. So what we saw is that if you max out the RAM given to a single CPU, it was much better in terms of availability of the resources from AWS perspective. So what we did is that at first we maxed out the ratio, so we went from one CPU to four gigabytes of RAM and it was fine, and then we copied it to Kubernetes. So we said, okay, a thousand M release. of the CPU and 4000 ms of the memory or four gigabytes of memory. I don't remember which one is that in Kubernetes. And we did that and it was our entry point. That's how we started with scaling everything and then we were playing out with okay do we want to have the burstable instances, do we want to go with the C type instances, M type instances, all that stuff. And it took us some time to play with it and what we ended up was... We are using Karpenter now, so it's setting up the nodes automatically for us, and it's handling all of the mangling of the actual... what kind of instance is it But what was important for us, we saw that we don't really have to specify the class, like the instance class or the instance type, the MRC, whatever. But what we do is we require the metro CPU type. So this is the same slightly broader and this is a certain generation of CPUs. And this is enough for us because they are comparable in performance. So we don't really have to go for exactly... C7G whatever instance, we go for a certain type because we have certain RAM to CPU requirements, and then the neutral requirements, and then everything else is handled by Karpenter.

Bart Farrell: Beyond the cutover issues, you made over a hundred configuration changes in the first month. What were the most critical things that you missed?

Radosław Miernik: So the security config, of course, environmental variables, autoscaling. And autoscaling was haunting us for weeks, so we really tweaked it almost daily. So there was that. Then it was also a couple of things regarding the networking. So for example, the maximum header size in HTTP requests. That was one thing that popped up at surprisingly late stage. So there was that. What also was changed quite often was... what was configured on the service chart and what was configurable from the outside. So for example, at the beginning what we did when we were moving to Kubernetes, we allowed all the environmental variables to be set externally so we could configure them more easily. But at some point we tried to, for example, move them into the application chart. So it was impossible to start the application incorrectly, things like that. Or configuring the external secrets so we don't have to provide them in the environment variable, but we can, for example, set them in AWS SSM and then reference them in the image or in the service definition. So it was quite a lot of nitpicking that we wanted to do to make it look right and feel right, but at the same time, it was a lot of work regarding the actual traffic shaping. So for example, adjusting the RAM, adjusting the CPU, adjusting the autoscaling, and then also things regarding the Karpenter, like disallowing it to scale down during peak hours or disallowing it to handle the drifted resources during, for example, late evenings, because then we saw that the response times were peaking because it was collapsing them too aggressively. So a lot of tweaking all of the knobs that we got because previously we didn't get any. And now we've got a lot of configurations that we had to review at some point.

Bart Farrell: And I know you mentioned autoscaling in here, but to take this a little bit further. One thing that stands out is using the wrong memory metric for autoscaling. What went wrong there?

Radosław Miernik: So the biggest problem was that we went with the default. And so we had the HPA configured with both RAM and CPU. So the WebSocket containers were mostly RAM-bounded, or the memory-bounded, so they scaled with the traffic. And the API ones were mostly CPU-bounded, so they scaled with the number of requests. It may seem the same, but number of requests and WebSockets, it's more about the number of sessions. So what was the problem was that the API is not really causing any issues with the memory. So what it does is that you get the request, handle it, and get rid of it. And as we are using Node.js, we have a garbage collected language, of course, so it wasn't really a problem because the memory was allocated and deallocated all the time. But with the WebSockets, The memory is usually accumulated because of the internal caches, but also disk caches. And those caches are technically evictable. So we are using Linux as our operating system in those Docker images. We were able to, for example, SSH into the Docker container, or into the pod, and then evict the caches and see that the memory dropped on the instances, and the HPA saw it and that was nice. But then the problem was that our application was scaled based on the memory it was using, except for the evictable caches. So it hit out of memory errors, whenever it was close to the limit. And it was also harder to debug because it was much more visible if the nodes were full. Because if the nodes weren't full, there was some memory space that you could go over and it wasn't really killing anything. But then we had to switch from the, I don't remember the names, but from the memory to working set or something like that. You have to figure it out. It's quite popular answer on Stack Overflow regarding those settings. Plus you have it in the blog post if you want to.

Bart Farrell: Now, these are some pretty healthy infrastructure lessons. Your application runs on Meteor.js, which has specific characteristics. What Meteor specific considerations came into play?

Radosław Miernik: So the point is that a lot of traffic is wired through the web sockets. So the whole point of Meteor is that... you're trying to reduce the amount of data sent over the wire. So in normal applications, when you're thinking about real-time, you are either polling the server, or the server sends the updates to the client constantly. And in here, the same methodology is slightly different. So how it works is that the server has an image of what the client knows, and then it only sends the differences. And to be able to send the differences, the server has to know exactly what every client knows. So it keeps a lot of memory on the server to be able to calculate these differences from the dataset or the database, for example. And this meant that all of those WebSocket containers, all of them were memory-bounded, as I said. And also, what is the default in Meteor community is that you are trying to have a sticky session because you want to be connected to the same container in case of you reconnecting or disconnecting. you want to be reconnected to the same container as you were before, because there's a high chance it still has the same cache that was serving you before. And this was a problem for us whenever we, for example, scaled down or scaled up or rolled out a new version, because then the people were not really spread out evenly, so we disabled the sticky sessions, and then we also disabled the SockJS emulation, because we saw that the default and raw WebSocket connections work better for us. And SockJS is a great library. It served us for many, many years. But nowadays, most browsers support WebSockets out of the box, so you can choose it directly.

Bart Farrell: You also separated WebSocket containers from API containers entirely. Why was that?

Radosław Miernik: So because they had different boundaries, like some of them, again, CPU-bounded, some of them, again, memory-bounded. we saw that it doesn't really make sense for us to autoscale all of them at once because our API is mostly used by us, but it's also used by third parties. And that's a big chunk of our traffic, and they have different traffic characteristics than us. So our users wake up in the morning and they work for the day and then go down in the evening, but then our third parties are, for example, synchronizing mostly in the night. So we split those two into two separate containers, the same code, the same container, just different configuration, and on the gateway level, on the Nginx level, they have different traffic redirected. So certain paths, for example, slash API, all of it goes to the API containers, and for example, slash WebSockets goes into WebSocket containers, and then this is handled by two separate groups, and they have separate auto-scaling rules, they have different memory and CPU limits. They have different environmental variables, all that stuff.

Bart Farrell: With WebSocket containers being memory bound, you ran into an interesting memory management challenge. What happened and how did you solve it?

Radosław Miernik: So if you have an application, there's a high chance there is something leaking. And the funny thing is that on ECS, it was never a problem because whenever we were using ECS and we had the spawned instances, we were quite often getting like... rotated by ECS itself. So our containers never lived for more than a day or more than three days. And then when we moved to Kubernetes, and even though we were using Karpenter to still use, for example, spot instances, we didn't really get rerolled that much often. And that was because the instances, the nodes we were using, were not really, I think, that high in demand. You can see it on the AWS website. You can see for each node type or for each instance type, you can see the probability of being evicted at a certain time frame. It also depends on the region, very highly depends on the region. So for some regions, some instance types are not that highly requested. And when we switch to Kubernetes, our containers with the same configuration and the same instance underneath, to some extent, they lived for weeks. And this was a problem because even though we released quite often, weekends happened. And then during the weekend, the application was going for four or five days. And then the memory was leaking somewhere. And we didn't get really that much time to work on that. So what we looked into was a different memory, different memory allocator. So we switched to jemalloc or jemalloc, whatever. And it really helped a lot. It allowed us to not only get rid of the visible leak, because it just... Right now it can run for a couple of weeks without getting restarted, but also it improved the auto scaling because the default GNU malloc that we had there wasn't really that keen of getting rid of this memory. So the horizontal pod autoscaler, it wasn't really that keen to scale down in the evening, so we had to scale it down forcefully or we had to scale it down on CPU or time-based, which wasn't really ideal. We're switching to a different memory allocator, and it was jemalloc. It is still jemalloc, and we are considering mimalloc right now, or maybe tcMalloc, we'll see about that. It looks like the way you expect it to look. So it's an actual slope over a day and not just constantly growing.

Bart Farrell: A 20% memory reduction from changing allocators. So looking at the full journey, how long did this take, and how did the phases break down?

Radosław Miernik: So I would say that the pre-application time, so the CI runners, this GitHub ARC operator, I would say roughly a month, maybe slightly more, because it was all of the initial setup, creating the EKS cluster configuration, networking, Grafana, Prometheus, all that stuff. So that took, I think, the longest hour-wise. Then the non-production environments were slightly faster, but it still was roughly a month because we had a lot of experimentation, also some performance testing, configuration tests, we had to also take care of the domains and all that stuff. Then maybe a month as well for production, so the suite itself, as I said, one morning and it was done, but then those tweaks regarding the security, ModSecurity, networking, configuration, auto-scaling, all that stuff. And then months even to still tweak it. We tweak it till this day, that is true. We really do. Especially whenever we update some tools, update Grafana or update something else, we have to tweak those settings because, for example, a different version of certain tools are using different are reserving different amount of CPU by default and because we had all of the nodes maxed, suddenly we are having a lot of empty spaces there. So now we have to tweak it by reducing the requested CPU by 20 millis or something like that because we are trying to max out the instances also on the memory side. So that's a lot of time if you look at it. That's months. that is really months, but at the same time, except for the beginning, it was an hour a day, maybe two hours every two days. It wasn't four months of work every single day of a single person.

Bart Farrell: Four months is a significant investment. Obviously not as much as a PhD, but still it's a good amount of time for the people involved. What were the concrete results?

Radosław Miernik: So overall, I'm really happy with the fact that it is cheaper. It was cheaper back then because obviously right now we scaled up. So it is no longer cheaper. it will be even more expensive on the ECS. So I would say that the cost reduction was really visible and it was mostly because it's not like spot instances are cheaper raw rather than ECS, but it was cheaper because we were able to overlap some of them. So for example, what we can do is we can have a single instance with four CPUs and instead of hosting four application instances in there, we can host five in there. and they can be assigned a lower number of CPUs, like 0.5 or 0.6 CPUs, but if they need more, they need to spike to 1.5 or even 2 CPUs, they can do it without interrupting anyone else. So it seems like it is better because our traffic is very unpredictable. So the fact that Karpenter is able to squish all of it into nodes and then... I really like the distinction between requests and limits in Kubernetes. And we are really depending on it quite heavily. And that's really allowed us to share the leftovers of the CPU that we had.

Bart Farrell: For teams that are considering a similar migration, what's your biggest piece of advice?

Radosław Miernik: Take your time. Really take your time. I know it's really promising. I know it is really also costly to have two of them running at the same time. From our perspective, it was... doubling the costs for two months because we had it running in the background. So it is a cost but I really wouldn't rush it and if you have any sort of tests, if you have end-to-end tests, if you have some integration tests, if you have load tests configured run everything you have because there can be some things that you missed or the performance characteristics can be different. So maybe you don't need that many containers or you need more because you're using a different CPU type. For example, you're using a burstable CPU type and go for something else. So take your time, test it out and also while doing the actual switch it's much safer not to scale down the previous version that you had you if it was ECS or anything else, or Fargate or anything else, let it be. accept the cost, bite the bullet, let it live for a day to maybe a week even, and move the users to the new one so they can do it slowly. They can take their time also to be switched to the new instance, and then you have more time to fix it only for those who switched instead of all of the mat1s. So it's also much easier to deal with five... customers calling you rather than 500 customers calling you because you made some infrastructure change.

Bart Farrell: Radek, what's next for you?

Radosław Miernik: What's next for me? First of all, I have to finish my PhD this year, so that's then. And then we have a couple of topics regarding this whole cluster in Kubernetes. For example, we want to look into Pyroscope from Grafana Labs. We want to look into the APM and the actual profiling. live profiling of the application in Grafana because we have a different tool for that and we would like to integrate it in there. Plus, what I'm also interested in looking into at some point, is those tools that can monitor the cost of the cluster in real time. Because right now what we do is we go to AWS and we can see the costs, for example, based on a day or two, but we don't really have that granular distinction, for example, how much the CI costed us and how much the production costed us and how much was for the test environments. And there are some tools that can track it based on the nodes that you are reserving via Karpenter. So we also would like to set it up and see how it goes.

Bart Farrell: Okay. And if people want to get in touch with you, what's the best way to do that?

Radosław Miernik: So first of all, I have my email both on my website and on my GitHub. So if you go to any of those, radekmie.dev or radekmie.github, you can go there and copy the email if you want to reach out. I don't have LinkedIn. There is another person with the same name and last name, but this is not me. So email, preferably. And I'm on a couple of Slacks and Discords, but these are more. If you are from this community, then you know I'm there. And if not, then email.

Bart Farrell: Yep, perfect. Email is what worked for us, so I can say it definitely functions. Thanks so much for sharing your time and your expertise with us. Best of luck on the PhD. Look forward to hearing more about that when that finishes, and hope our paths cross in the future. Take care.

Radosław Miernik: Yep. Thanks for having me. And take care.

From ECS to Kubernetes: A Real Migration Story

Relevant links

Transcription