Lots of people are talking about platforms, frequently talking about it as an IDP or internal developer platform. Essentially what this means is that you have a unified infrastructure that enables your developers to deliver applications. You may think of Kubernetes itself as a platform, but it really functions more as the foundation for a platform. A Kubernetes cluster by itself doesn't have all of the pieces necessary to enable development teams to deploy their code.
Out of the box, Kubernetes doesn’t provide any best practices or guardrails for deploying apps and services easily while still meeting security, reliability, and cost optimization needs — which is why the IDP concept is growing in popularity. In the past, we've talked about building a Kubernetes platform by doing continuous integration/continuous deployment (CI/CD) with GitOps and installing add-ons, such as cert-manager, external DNS, and an ingress controller. Pulling all of these things together make Kubernetes a holistic platform that you can deliver to a team of developers to allow them to deploy their applications.
We’ve also addressed the need for Kubernetes governance: the policies and guardrails that enforce and inform best practices and security requirements, and in some cases also enable compliance in your cluster. The last piece of the picture is the feedback loop — which is where monitoring comes in. When you're doing things wrong or you need to change something, you want to provide that feedback back to the code that's deploying it as well as the people deploying it. That means getting developers feedback as quickly as possible in the tools they already work in (typically in code repos or pull requests).
One thing that people are often confused by is the difference between observability and monitoring (which are often used interchangeably), in part because they are often based on the same overlapping metrics and data sources, but they do have their differences.
Observability is considered a bit more proactive than monitoring. It means having a system in place that has outputs that you can view and use to understand the state of your system or application. Similar to car dashboards, observability tools provide a dashboard that collects metrics, logs, and traces from all parts of the system and makes it available for analysis. It might show the health of your servers, indicate disk space, memory, and CPU utilization.
Monitoring uses some of the same metrics, but is more reactive. You can use monitoring to create indicators for how your system is performing, including the overall health of your system. Monitoring is like the Check Engine on your car dashboard. There’s a trigger in the monitoring for each car that says if a specified element reaches the threshold indicated on the dashboard, it alerts you to the state of the system at a point in time. This is the same concept in Kubernetes.
For both monitoring and observability in Kubernetes, you’ll need to deploy some add-ons.
There are challenges in Kubernetes related to both observability and monitoring. In Kubernetes, a lot of things (specifically your pods and nodes), are ephemeral by nature. You have nodes that come up out of nowhere and you have nodes that disappear. The old, static view of, "these are the machines that I should have," doesn't work anymore. Now you need a monitoring system that's very flexible.
You also have to consider that Kubernetes is distributed and multi-tenant by nature. In a distributed multi-tenant and ephemeral environment, things are a lot more complex than when you were monitoring a single machine that exclusively ran a web server.
Now, you may have two containers living on the same node competing for the same CPU resources or the same memory. Networking also becomes more complex because there is now a software-defined networking layer in between the host network and your workload. The level of complexity increases, as does the number of potential problems.
Also, there are a ton of metrics in Kubernetes environments. Kubernetes core components (the apiServer and controller-manager, for example) are instrumented for metrics out of the box, as are many standard add-ons, such as metrics-server and ingress controllers. You’ve likely enabled metrics for your applications as well. Making sense of what all these metrics are, what they mean, and what's important (and not important) can be really overwhelming.
The Cloud Native Computing Foundation (CNCF) provides a lot of options for observability and monitoring. Grafana and Prometheus are two open source tools that usually go hand in hand because Grafana is used as a front end, or a dashboard, for Prometheus. Prometheus also has its own dashboard, a user interface that includes some visualization. Most Kubernetes components and add-ons are instrumented to provide metrics that Prometheus can scrape automatically. Grafana allows you to create dashboards that can help you create a holistic view of your system. Datadog is a commercial product that offers a lot of good features as well.
There are a lot of good tools, and the tool itself is less important than how you use it. Monitoring really comes down to how you monitor, what you monitor, and the theory of monitoring you’re using. Ultimately, you should choose a tool that serves the needs of your own environment best.
Monitoring is the point-in-time definition of your health based on a threshold. Monitoring alone provides different options for how you observe triggers coming across. Monitoring means that you’re recording things and displaying these thresholds on a dashboard. Some of these things may not be actionable; they may simply be information collected over time. And monitoring alone isn’t necessarily time sensitive. It could simply bring your attention to an issue that has been triggered 12 times in the last week and now merits investigation.
The next level on top of monitoring is alerting. Alerting generally follows monitoring because when you're monitoring for health, you may need to alert somebody that the monitor has been triggered (sometimes waking you up in the middle of the night). Alerts are generally time sensitive and need to be actionable, so when you do get the alert, there’s a specific action that must be taken.
There are a few layers to monitoring:
Cluster level monitoring: the overall performance of your cluster, including your control plane, your controllers, and more.
Node level monitoring: the nodes that your workloads and pods are running on.
Pod or container level monitoring: this type of monitoring dives into application monitoring.
You can monitor at all these different layers— and you can monitor the same sort of things at different layers (just to make things a bit more complicated).
Monitoring everything is a safe bet — it will ensure that everything is covered, but there are trade-offs involved. The most obvious one is storage; you have to store all that information somewhere, which incurs a cost. And it may not yield you any real success or results. Monitoring everything also generates a lot of noise, and it’s difficult to pay attention to everything and identify what actually requires your attention.
Here are is a list of things that the team at Fairwinds recommends monitoring:
Monitor your pods and their top level controllers: are your controllers working as designed? Are there a lot of errors or latency or timeouts?
Monitor how much memory and CPU your workloads are using.
Monitor container-native metrics: track metrics for the orchestration layer, overall cluster health, and the life cycle events for your workloads.
Monitor cost: if the cost of your workloads goes over a certain amount in your cloud environment, you want to know about that. What does it cost to run your clusters? Can you forecast costs to plan a budget for the next year?
Monitor runtime security: Falco, for example, monitors your container runtime activity via kernel level system call tracing.
Monitor expected behavior and performance or activity of your workload: this will help you detect anomalies and catch potential issues earlier.
Google’s SRE Handbook has a very useful chapter on monitoring distributed systems. There are four golden signals for a service that establish the base level of things that you should worry about:
Latency: How long does it take to service a request?
Traffic: How much traffic is coming in? How many requests per second or transactions per second are there?
Errors: What is the rate or percentage of requests that fail?
Saturation: How full is your service? It is sometimes easier to think of this as utilization. How much capacity does your system have for constrained resources like disk I/O or network throughput, and how much of that capacity is in use?
If you measure all four golden signals and you page a human when one signal is problematic or nearly problematic, in the case of saturation, your service will be reasonably well covered by monitoring.
Even though monitoring and alerting are important for maintaining the health of your system, make sure you only alert when you know there’s really a problem. You can put monitoring data and low priority alerts into Slack or a similar tool so it doesn’t get lost but it doesn’t wake anyone up unnecessarily. Restrict alerting to issues that affect your users and are actionable — and aren’t problems that you can automate a solution for.
Using these straightforward guidelines, you minimize the list of monitors that you need to alert on. These suggestions can help you keep your systems running smoothly and ensure that you only get alerts on the problems that really need human intervention. When you’ve set up your IDP using the right guardrails and governance, you can make the most of Kubernetes’ self-healing nature and minimize those middle-of-the-night alerts.
Want to walk through a more complete discussion of these topics? Watch this Kubernetes Clinic webinar on demand.