If you are managing a team of engineers using Kubernetes, how do you know what’s really happening in your clusters? That’s a problem we hear about a lot.
It’s a problem not often addressed, but it can lead to wasted time and money along with broken apps.
We talk a lot about the importance of liveness probes. For most engineers, they won’t be proactive about setting these until there is actual downtime (and even then it might not be a priority!).
At my previous company, one day an application went down. The developer just thought that Kubernetes was broken, but we didn’t know what was actually happening. After a couple of hours, we realized something inside the container crashed, but it didn’t bring down the entire app because the main process didn’t die so Kubernetes didn’t try to restart it. If a liveness probe was set, this never would have happened. And while an engineer may not see the need to fix configurations without liveness probes set, if you are managing a team, simply creating a written policy that dictates to set liveness probes won’t keep you out of hot water.
In another example, I got an angry message from a colleague about an application that worked fine locally, but when put into Kubernetes, the performance went from half a second response to 20 seconds. The problem was that the person hadn’t set any request for resources and the node that pod was running on happened to have a noisy neighbor. Once a CPU request was set on the pod, the performance of the application was fixed. Even if you had a policy that told engineers to set the resource request, it can be too hard to track down who is following it.
Engineering leaders need visibility into what clusters have liveness probes set, where resource requests aren’t set and what (and how) clusters need to be fixed.
No company can afford to waste money. Over provisioning clusters though is extremely common. With requests set too high, you will be burning money on memory or CPU that aren’t in use. I’ve seen examples in the past where overprovisioned containers are wasting up to $100 on a single workload per month. If that’s happening across 30 workloads, wasting $3,000 can easily happen. My brother was really fond of an expression “Pay attention to the ounces, and the pounds take care of themselves.” Most clusters probably don’t have a workload that’s wasting that large of an amount of resources, but there are enough small changes that can be made here and there that add up to a lot.
Gaining multi-cluster visibility into what workloads are spending is difficult… very difficult. In fact, even looking at one individual cluster in one workload is hard. So providing evidence that you are either saving money or not wasting money in Kubernetes can seem impossible.
Managers need visibility into where requests can be adjusted to save money. And it needs to be possible across all of your workloads.
Finally we are at security. While there are plenty of solutions on the market that can help you identify images with vulnerabilities, seeing security alerts across one or all clusters can often mean running a number of different tools. Even if you do piece those tools together, you may not have a dashboard view of the results. You’ll need to spend time scanning each cluster, aggregating results and then fixing the problem. That’s a lot of work to do while clusters are potentially causing security risks.
Here, engineering leaders need to be able to see the security status of their clusters so they know what they really need to worry about.
I’m part of the team that developed Fairwinds Insights, Kubernetes configuration validation software. Prior to joining Fairwinds, I spent time managing Kubernetes infrastructure. I’ve experienced the problems first hand when configuration is not set properly and I’ve also experienced a lack of visibility into what’s happening in Kubernetes.
Fairwinds Insights addresses these challenges for those responsible for managing multiple teams and multiple clusters by bringing data from different clusters together.
Fairwinds Insights is available to use for free. You can sign up here.
Insights offers a compare feature. If you are managing teams, you can jump into the dashboard to compare views on what’s different between your staging and prod environments. You can see if there are different images, if CPU and memory requests exist and if limits are set. You can also make sure that clusters are perfect mirrors of each other from staging to production.
The dashboard gives you multi-cluster visibility in one place. If you didn’t have this, your team would need to:
In the command line, login to each cluster
Run a kubectl get all
Then eyeball the two lists or run the lists through a diff tool
This would be just for one small workload! It’s really, really, really painful.
Next up in the dashboard, you can view action items across all the clusters in your organization and filter by your concern - reliability, efficiency or security.
Here you view what clusters are missing liveness or readiness probes, image pull policy, and where memory requests are too low and will cause downtime. Managers get this holistic view, while still being able to dig into each cluster. Fixing configurations means downtime is avoided, performance improved.
My example earlier gave some ball park figures, whereas Fairwinds Insights gives exact figures. With Insights, you get a view into your actual usage and can view recommendations about that usage based on historical data.
When you gain visibility across your entire cluster's efficiency, you can either save money or prove to your manager that Kubernetes is running cost-effectively.
Fairwinds Insights also continuously monitors your clusters and gives you a view across all of them into your security posture. In action items you’ll see security alerts by severity - danger or a warning. Plus when you dig into each action item, we’ll give information on how to fix the problem. You’ll be able to compare what clusters are more secure than others and demonstrate compliance.
What you get with Fairwinds Insights is a tool that helps you run Kubernetes better and you’ll actually know what is happening inside your clusters.