There are many things you can monitor in Kubernetes but you need to understand what is mission-critical in terms of monitoring. In a recent webinar, we explored what you should be monitoring in your Kubernetes platform, best practices to follow, and why Kubernetes monitoring is so critical to cloud-native application development. At the end, we had great questions and we wanted to share the answers more widely.
It’s important to get an understanding of how much each workload is costing you today, why it is costing you that much, and what changes you might be able to make to reduce those costs. For instance, if you set your requests to one CPU for a given workload and it is actually only using half a CPU, you basically have half a CPU of waste there. In some cases, a workload might be under-provisioned, leading to reliability issues. By monitoring your workloads, you can determine whether any are constantly bumping up against CPU and memory thresholds, in which case you should expand your resource allocations accordingly. To get a clear picture, you need to analyze the cost of a running workload that has been operating over a period of time.
Prometheus, Grafana, Datadog, and Fairwinds Insights complement each other nicely. Essentially, Prometheus, Grafana, and Datadog do tier one monitoring. They will tell you if your house is on fire right now. If your applications are crashing, if they're about to crash, these tools will show you where problems are right now. Fairwinds Insights will show you if you need to fix vulnerabilities, adjust over-provisioned workloads, fix configuration issues, and other things like that. It complements the other monitoring tools.
There are multiple frameworks you can use. Internally, we use Datadog for our applications because it works well out of the box. There are also open standards, such as Open Telemetry, which is a collection of APIs, SDKs, and tools you can use to instrument, generate, collect, and export telemetry data to analyze software performance and behavior. Open Telemetry includes both open source and paid solutions. Prometheus and Grafana are a great stack for observability within Kubernetes, but what you choose depends on how you want to balance ease of use, cost, and how well the platform is understood by the community versus newer platforms that may offer different features.
You want to have a platform or SRE team owning the core metrics for the nodes themselves, for the services that are running Kubernetes itself, the control plane, any add-ons that you have running (such as cert manager or NGINX Ingress), and anything else you have that supports applications running in Kubernetes. At the other end of the spectrum, if you have your applications themselves running in Kubernetes, the logs that those applications emit, including jobs spinning up for those applications and scaling events, should be owned by the application team. They will have the context to understand what a log message means when an application returns 500 errors and similar issues.
There is also some gray area between an application having trouble scaling or running out of memory and CPU, where resolving the issue might land on the applications team or the SRE team, depending on the root cause of the problem. You’ll need to have some collaboration between the two teams. Usually, if the platform team is monitoring those things and can’t identify an obvious cause from their end, they'll open up a ticket for the application team, indicate that the application is redlining, and ask them to resolve it.
Baselining is hard, in part because it is always an ongoing process. You can’t set up monitoring once and be done; you have to constantly refine what you're tracking from an observability standpoint, what you're showing in your dashboards from a monitoring standpoint, and what you're alerting on. This is an ongoing flywheel of refinement.
To start, make sure you're tracking the four golden signals (latency, traffic, errors, and saturation) and looking at what normal behavior looks like. If you're launching your application today, start looking at those things and see what normal behavior looks like for that first week of your application's life. You need to be reviewing your dashboards constantly in those early days because you don't know what normal looks like yet (and you don't know what danger looks like yet, either).
At the beginning, you’re really experimenting and figuring out what a baseline looks like. Once you have a good intuition for how your application behaves under normal circumstances, you can start to set those thresholds and say:
Then make a dashboard that shows those things front and center.
The first best practice is to use namespaces. We've seen companies in which basically every app (and everything else) gets deployed into the same default namespace. This creates a lot of headaches. Everything is co-mingled together and it’s not possible for particular teams to separate things out or permission things differently. Namespaces provide a separation to make sure that app team A doesn't have access to app team B's things and vice versa. You can apply policies for namespace A that don't apply to namespace B, for example.
At the very least, separating namespaces by teams or by applications is a really good practice. There’s also the concept of hierarchical namespaces, where you can create a namespace and then create sub-namespaces within that. So you could have one namespace for all your applications, then a sub-namespace for each individual application or a sub-namespace team, and then a sub-sub-namespace for each application those teams manage.
The second iteration of this concept is adding labels to workloads, whether it's a team, an owner, or a cost center. That provides additional ways to track or attribute resources to a business unit or individual. You can also use labels to look at similar things across namespaces. For example, you might have apps labeled staging versus production, which can help you give a good signal to your teams as to what's more important.
Whether you're using Kubernetes or not, monitoring is crucial. If you're not monitoring what's going on inside of your application environment, you're at risk for an outage. Monitoring shows your general health metrics and whether your application is behaving properly.
It might seem like your worst case scenario is having a hard outage. Your users can't access the app. You're losing revenue by the minute.
But what about when things are going wrong — but you don't have a full-blown outage? Maybe 10% of your users are struggling to use the app, they're getting error messages, and the platform is saturated. If all your users are having a degraded experience and you're not able to see it because you don't have the right metrics showing you that latency is through the roof or your database is under pressure (for example), that bad user experience can go on for hours or days before someone on your team notices. This can be worse than a hard outage that you know about (and can take steps to resolve) more quickly.
There's an open source solution called Velero that does something like this. Generally speaking, we advise our clients to store everything as infrastructure as code. Every cluster we manage is done entirely as infrastructure as code. That means that if one of those clusters were to disappear, we can recreate it within minutes by taking all that infrastructure as code and reapplying it to a brand new cluster. That's how you should be thinking about a backup strategy: all your state must be managed in a Git repo somewhere, so that if your cluster dies, disappears, or totally fails, it can be recreated from scratch using that infrastructure as code.
We don’t have a hard recommendation for one or the other, but we use Datadog internally. It does a great job of taking all of our logs and making them available through Datadog, where we can index them, decide how long to keep them, and other things like that. It grabs all the Kubernetes metrics out of the box; you just install the Datadog agent and it runs. Splunk probably has something very similar, so experiment with both using a sample cluster or a small Kubernetes cluster to start and see which one you like better.
Kubernetes is well-known for being complex, but not monitoring will make that complexity much harder to manage. Make sure you're gathering metrics on your cluster events, logs, and traces. Monitoring will help you understand what things look like today, how healthy your apps and infrastructure are, and how to make improvements. And monitoring alone isn’t enough — you also need to put alerting in place so that if things do go wrong, you get alerted and can resolve issues quickly.
If you need help with monitoring and alerting, reach out. Our team has the Kubernetes expertise to make sure your Kubernetes infrastructure is fast, secure, and stable.