The truth is Kubernetes monitoring done right is a fantasy for most. It’s a problem magnified in a dynamic, ever-changing Kubernetes environment. And it is a serious problem.
While organizations commonly want availability insurance, few monitor their environments well for two main reasons:
When the average organization finally recognizes its need for application/system monitoring, the team is too overwhelmed just trying to keep infrastructure and applications “up” to have the capacity to look out for issues. Even monitoring the right things to identify the problems the application or infrastructure is facing on a day-to-day basis is beyond the reach of many organizations.
The truth is Kubernetes monitoring done right is a fantasy for most.
There are a number of consequences you’ll face without adequate monitoring (some that are universal, others that are exemplified in Kubernetes).
Insufficient monitoring introduces a lot of heavy work because you need to constantly check systems to ensure they reflect the state that you want.
What’s needed is monitoring and alerting that discovers unknown unknowns - otherwise referred to as observability. Kubernetes best practices involve recognition that monitoring is key and requires the use of the right tools to optimize your monitoring capabilities. What needs to be monitored and why? Here we suggest a few best practices.
With Kubernetes, you have to build monitoring systems and tooling to respond to the dynamic nature of the environment. Thus, you will want to focus on availability and workload performance. One typical approach is to collect all of the metrics you can and then use those metrics to try to solve any problem that occurs. It makes the operators’ jobs more complex because they need to sift through an excess of information to find the information they really need. Open source tools like Prometheus, OpenMetrics and vendors like Datadog help standardize how to collect and display metrics. We suggest that Kubernetes best practices for monitoring includes:
A genius of Kubernetes is that you can implement infrastructure as code (IaC) - the process of managing your IT infrastructure using config files. At Fairwinds take this a step further by implementing monitoring as code. We use Astro, an open source software project built by our team, to help achieve better productivity and cluster performance. Astro was built to work with Datadog. Astro watches objects in your cluster for defined patterns and manages Datadog monitors based on this state. As a controller that runs in a Kubernetes cluster, it subscribes to updates within the cluster. If a new Kubernetes deployment or other objects are created in a cluster, Astro knows about it and creates monitors based on that state in your cluster. Essentially, it provides a mechanism for dynamically creating and managing alerts in a way that Datadog can understand.
Because a diverse set of stakeholders is involved in monitoring cluster workloads, you must determine who is responsible for what from both an infrastructure and a workload standpoint. For instance, you want to make sure the right people are alerted at the right time to limit the noise of being alerted about things that do not pertain to you.
Monitoring tooling must be flexible enough to meet complex demands, yet easy enough to set up quickly so that we can move beyond tier 1 monitoring (e.g., Is it even working?”). Tier 2 monitoring requires dashboards that reveal where security vulnerabilities are, whether or not compliance standards are being met, and targeted ways to improve.
Impact and urgency are key criteria that must be identified and assessed on an ongoing basis. Regarding impact, it is critical to be able to determine if an alert is actionable, the severity based on impact, and the number of users or business services that are or will be affected. Urgency also comes into play. For example, does the problem need to be fixed right now, in the next hour, or in the next day?
It is difficult to always know what to monitor ahead of time, so you need at least enough context to figure out what’s going wrong when someone inevitably gets woken up in the middle of the night and needs to bring everything back online. Without this level of understanding, your team cannot parse what should be monitored and know when to grin and bear turning on an alert.
Read in-depth insights into how to optimize monitoring and alerting capabilities in a Kubernetes environment.