Welcome to our deep dive into the world of Kubernetes, where we share some of the top lessons our site reliability engineers (SREs) have learned from years of managing this complex yet essential cloud-native technology. During a recent Kubernetes Clinic webinar, SRE Brian Bensky joined me, and we talked through our extensive experience managing K8s for clients, helping clients go beyond just running clusters to using Kubernetes as a platform that enables you to run applications successfully. Let’s walk through these lessons learned to help anyone navigating Kubernetes.
One of the first lessons highlighted in our session was the importance of determining whether Kubernetes is the right fit for your organization. Kubernetes offers significant advantages in scaling and managing containers efficiently but comes with some significant management overhead. For smaller-scale applications, there are alternatives that might provide the simplicity and cost-effectiveness you need without that overhead.
All of the cloud providers offer great ways to run containerized software or containerized apps at a very basic level quite easily without having to worry about the overhead of Kubernetes, such as AWS Fargate, Azure Container Apps, or Google Cloud Run. Until you have a lot of scaling happening and a lot of different containers and microservices working together, start with something like that and get your feet wet with running containers and some of the concepts around them. Keep in mind that while all the different tools are important, Kubernetes itself will always take the shape of the organization that it's in!
The Kubernetes ecosystem is vast and can be overwhelming due to its complexity and the number of plugins and add-ons available. When you’re getting started, look at exactly what your apps are doing. What are the key capabilities you need to have? What do you need to focus on? Often what you need to get started is a lot simpler than you think.
Sometimes we hear about teams thinking that they need service meshes right away — and they might — but chances are that they actually don’t. Start by focusing on the core things you’re trying to do and get really good at those things before you add more in. There’s no need to set up tooling that’s far beyond what your organization needs. Simple is easier to understand and update, and it makes it easier to ensure that anyone who needs to interact with it has the necessary skills.
Every company has different requirements. If you’re using the cloud, you can use cloud providers for your databases, message queues, and so on. And remember, it takes several levels of experience to know how to be a database administrator, and even more within a containerized environment. Keep your hands-on work requirements simple, and over time you can bring more complexity in-house as needed.
Perhaps predictably, the best place to learn about all the new things happening is the Kubernetes project itself. They have excellent change logs and release notes as new versions come out, which document all of the changes and deprecations that you need to keep up with. Kubernetes is very fast-moving; it releases a new minor version every four months, and within these releases, there could be multiple resource and API deprecations. Many other venues exist for keeping up with the people, projects, and communities who use and shape Kubernetes; these can be informative as well. A favorite of mine is the Google Kubernetes podcast.
Kubernetes is constantly releasing new alpha and beta versions for different aspects of the project. Being aware of these changes can help you understand what's coming on the roadmap, so you know what things to look at, what to try out, and what a change might mean as far as the ecosystem goes. Kubernetes is fast-moving, and it does not have a long-term support version, so you need to create a plan to ensure your org can keep up with the upgrades.
In the same way Kubernetes is in a never-ending cycle of upgrades and changes, the tooling and components surrounding the ecosystem change too. For example, some popular add-ons you might use to extend capabilities are Datadog for monitoring, Argo CD for managing deployments, or an autoscaler like Karpenter. These are examples of complex tools/add-ons that each have their own update cycle. Try to keep add-ons and other tooling as up-to-date as possible. This helps you make sure the changes are smaller most of the time — maybe a patch or a minor version. Staying on top of versioning will make a lot of potential problems go away and keep you informed about what's coming soon. Focus on keeping your app running well. It is possible to keep things going smoothly with a fairly small crew of folks who know how Kubernetes is working in your organization.
Certifications can be helpful for ensuring your team understands Kubernetes well. For example, the main Certified Kubernetes Administrator course is very hands-on. The KodeKloud CKA course will really give you a solid understanding of the ins and outs of how Kubernetes works, why it does what it does, and what it's built on top of, as well as fluency with Kubernetes commands. These training resources have been very valuable to us as well!
A lot of people we speak to experience issues with over-provisioning workloads, where they end up utilizing just 20% to 30% of the infrastructure they have provisioned. We’ve seen it a lot too, so much so that we wrote a tool to help with monitoring your application’s resource usage over time to help you define reasonable resource requests and limits to hit that sweet spot between reliability and cost efficiency. That open source tool is called Goldilocks, because it’s all about getting your resource utilization just right.
Until recently, scaling was based on CPU and memory, but through new tools (such as KEDA, which is event-driven autoscaling), we're starting to be able to define our own metrics for scaling. If your app is very CPU- and memory-efficient but starts to get bogged down by lots of requests, there are ways to set up scaling based on the number of requests over time. You may even be able to control scaling based on particular endpoints. If you’re managing a fully distributed application, you can scale up only the microservices that are getting hammered.
The promise of the cloud was that you’d only pay for what you needed. Sounds great! But the challenge is figuring out what you need in order to know what to pay for. A tool like Goldilocks can also help with that. You can use Goldilocks to review historical usage; it will give you an idea of what your requests and limits should be at the CPU and memory levels. Fairwinds Insights can also do that and provide you with benchmarks for quality of service to help you manage costs more effectively.
Policy enforcement (such as ResourceQuotas) can also help you by making sure pods in a given namespace can’t use more than X amount of memory or CPU so people don’t overload clusters. You can also use policy to ensure that everything has requests and limits set, including your deployments and daemon sets. Otherwise, you may see big performance issues, which can increase your cloud bill, too.
Tools like Goldilocks can help you understand how much memory and CPU your pods are using. Then there are other tools, including Fairwinds Insights, that can translate that information into actual compute costs based on your cloud provider and the instance you're using for your nodes. This will give you a solid sense of what everything costs or what a specific workload costs. There are ways to separate workloads to identify which team is using a given workload and how much it is costing. Once you start to rein in the costs, the natural progression is to discover who is spending what within that overall spend and whether that cost makes sense.
Depending on how large your cluster is, it can be difficult to pinpoint Kubernetes costs. The easiest and most important thing you can do is to have a really good labeling policy for workloads, namespaces, and everything else. Separate teams into their own namespaces and then have labels for what team is responsible for each workload and who deployed it.
Once you put cost center codes in place, discovering exactly what's running becomes much, much easier. And that gives you somewhere you could run other tools to control output or figure out who's using the most resources. This is a great way to find phantom deployments that are sitting out there, eating up resources but haven't been used in months. Cost center codes are also excellent for forecasting. If you're trying to prepare a budget and figure out what you're going to spend for the next year, this approach will help you figure out the most efficient way to calculate your cloud spend.
Often, platform teams spend a lot of time providing Kubernetes support to developers. Enabling devs to self-service better looks different in every organization, depending on where you are in your Kubernetes journey. There are a lot of great open-source applications that you can run, like KubeApps. KubeApps sets up a user interface that's a marketplace of different deployments that people can click through and provision and get running with Ingress, which has everything set up for them. This can help developers get things up and running without needing to be able to get into Kubernetes through the command line at all or even understand anything about how it's running behind the scenes.
Another way to enable self-service for developers on your platform is by having good RBAC policies set up for teams and individuals that limit them to one, two, or three different namespaces that cuts off the ability to create or delete certain types of resources. This ensures they can't get themselves into trouble. The most successful teams have self-service platforms that give developers the ability to do things and be flexible without needing someone with Kubernetes expertise to hold their hands.
There are many layers to security, beginning through the application and the container it runs on and all the way out to deployment. Start with a tool that runs from the application level, something like SonarQube, to check for simple things that often get overlooked. At the container level, you want a tool like Trivy, which is excellent for finding Common Vulnerabilities and Exposures so you can keep those patched as much as possible.
Once you’re in the cluster, the easiest win for security is to make sure you have properly configured your network policies. This will help you make sure that the applications are only able to communicate with approved entities within the cluster or a set of external endpoints. If you don’t have anything set up, a newly deployed application can access any other workload in your cluster, which can lead to serious security issues. This approach can start you moving towards a zero trust model.
Managing Kubernetes effectively requires a blend of technical skills, strategic thinking, and ongoing learning. By starting simple, focusing on core needs, and gradually adopting more complex tools and practices, organizations can harness the full power of Kubernetes to drive innovation and efficiency in their operations.
At Fairwinds, we build and manage secure, reliable, and efficient infrastructure for our clients to deploy on. We also manage and handle pager coverage at the infrastructure level. This approach enables our clients to focus more on their business and get more sleep than they probably would if they had to answer infrastructure pages.
Want to focus on your apps and services, not your infrastructure? Fairwinds can help. Learn about managed Kubernetes-as-a-Service.