How to Keep Kubernetes Infrastructure Running When You Lost Your SRE

Losing a Site Reliability Engineer (SRE) can be a serious challenge for organizations relying on Kubernetes. SREs are crucial for maintaining the reliability and performance of Kubernetes environments, ensuring that applications are easy to deploy and scale. If your organization finds itself in this situation due to layoffs or when SREs leave for a new opportunity, here are some steps you can take to keep your Kubernetes infrastructure running effectively, both in the immediate aftermath of the change and long term.

1. Assess Current State & Documentation

The first step is to assess the current state of your Kubernetes environment and review any existing documentation. If your SRE is still available, request that they document and share their knowledge with a broader team before they leave.That group might include someone on the DevOps team, cloud or platform team, infrastructure team, or potentially the development team. Depending on the size of your organization and its technology teams, you might have someone who can fill the SRE shoes short term. The goal is to make sure someone understands the cluster configurations, deployed applications, and ongoing maintenance tasks. If documentation isn’t available, you may need to conduct an audit of the environment to get the lay of the land, because you will need this information to keep things running smoothly.

2. Empower Developers

If your SRE is no longer with the company, you might want to consider tasking developers with handling some Kubernetes-related tasks in the short term, with clear guardrails and runbooks so this doesn’t become unstructured, ad‑hoc work. This isn’t ideal, because of the fundamental difference in skill sets between developers and SREs. It also diverts devs from their core responsibilities and their lack of specialized knowledge can increase the risk of errors and potential for overload and burnout.

In the short term, though, you can ask developers who have familiarity with containerization and Kubernetes concepts whether they can help out. While developers likely do not have the same level of expertise in infrastructure as an SRE, they can still perform some routine tasks, such as checking for updates, monitoring resource utilization, and verifying that applications are deployed correctly. Long term, though, this work belongs in a dedicated platform/SRE function or with a managed Kubernetes provider, not as “extra” responsibilities bolted onto application teams.

3. Implement Automation

Ideally, your SRE already had automation in place, because automation significantly aids in maintaining a Kubernetes environment, minimizing the need for manual intervention. Infrastructure as Code (IaC) solutions, including Terraform and Argo CD, can help automate cluster configurations and deployments, reducing the need for manual updates and minimizing the risk of errors and misconfigurations. By automating repetitive tasks, you can ensure consistency and reliability across your infrastructure even after your SRE leaves.

4. Keep Up with Cluster Updates & Security Patches

One immediate challenge after losing an SRE is keeping up with cluster updates and security patches. This can be quite time-consuming, particularly for those who aren’t as familiar with the ins and outs of Kubernetes management. The reality is, though, that you need to keep up with these tasks because falling behind can result in security vulnerabilities and performance issues.

New high and critical Kubernetes CVEs were disclosed throughout 2024 and 2025, and that pattern is likely to continue into 2026, which makes staying current on patches and Kubernetes/EKS versions non‑negotiable. Regularly review and apply updates to ensure your infrastructure remains secure and you’re using supported versions of Kubernetes, add-ons, and APIs. If your clusters run on Amazon EKS, you also need to stay aligned with AWS’s Kubernetes version support windows and upgrade guidance so you don’t end up running on deprecated control plane or node versions.

5. Communicate with Stakeholders

If customers or internal stakeholders are invested in your use of Kubernetes, you need to be transparent about any changes to your infrastructure or challenges you’re experiencing. Communicating openly about your strategy for managing the infrastructure can help build trust and ensure you’re managing expectations effectively. Some organizations transition to different infrastructure when they no longer have the in-house expertise to manage Kubernetes effectively, so be up front about what’s happening and what you think are the right next steps to ensure your organization’s applications and services are available, scalable, and secure. You don’t want to put your organization’s reputation at risk because you can’t maintain your infrastructure.

6. Look for a Long-Term Solution

If you only have one SRE in house managing your infrastructure, it can be really stressful if they’re laid off or move to another opportunity. You’ll need to decide pretty quickly whether you want to hire a new SRE (in which case, start advertising the moment you can), train existing team members to develop the necessary Kubernetes skills (also something you’ll want to do quickly), move to a different platform, or bring in a managed Kubernetes‑as‑a‑Service provider to keep your infrastructure running smoothly.

With the 2025–2026 tech labor market swinging between layoffs and hiring freezes on one hand and very narrow, senior "unicorn" role requirements on the other, replacing a senior SRE quickly is often unrealistic. That makes cross‑training and managed services even more important as part of your resilience plan. Evaluate your organizational structure and resource allocation to make sure you are as prepared as possible for future staff changes.

Consider Managed Kubernetes-as-a-Service

If you’re having trouble maintaining your Kubernetes environment in‑house, consider transitioning to a Managed Kubernetes‑as‑a‑Service provider. These providers have the experience and expertise to handle complex tasks, including cluster setup, Kubernetes updates, maintaining add-ons and APIs, and configuring your K8s platform for optimal scalability and efficiency. This allows your tech team to focus on the jobs they were hired to do instead of trying to figure out infrastructure management.

Managed Kubernetes‑as‑a‑Service also provides you with immediate access to Kubernetes expertise, which is particularly valuable during periods of staff transition, and increasingly is a strategic choice rather than just a stopgap when someone leaves. Fairwinds provides the experience you need to keep everything running smoothly, continuously monitors for new Kubernetes CVEs and version changes, and handles upgrades and patching so you’re not relying on a single SRE to keep up with a rapidly moving ecosystem.

Benefits of Managed KaaS

Expertise: Provides access to extensive Kubernetes experience, reducing the need for in-house training and talent.
Scalability: Updates for add-ons and Kubernetes upgrades on an ongoing basis, ensuring high availability and performance without the need for you to spend time researching and implementing them in-house.
Cost Efficiency: Reduces operational costs by offloading infrastructure management without requiring additional headcount.
Security: Reduces the risk of breaches and keeps your infrastructure running smoothly by ensuring adherence to security best practices.

Keep Your Infrastructure Humming

Losing a critical team member is always difficult, particularly SREs, who often have little in the way of backup, especially at smaller organizations. There are a few ways you can keep everything running smoothly, though, including empowering your developers, using automation as much as possible, keeping on top of updates and patches, communicating with stakeholders, and planning for the future.

If you need more specific or immediate help, Managed Kubernetes‑as‑a‑Service from Fairwinds offers expertise, scalability, and cost efficiency, so you can focus on your core business objectives without worrying about how to maintain a secure and efficient infrastructure.

This post was originally published on March 14, 2025 and updated to reflect new information.