I’ve had a lot of arguments about monitoring and logging during my career in operations. Many of these arguments were centered around the idea that I wanted to monitor and log “too much”. That’s not necessarily wrong, but I’ve always thought of it more as “monitor and log everything.”
The reason for that is really simple. When (it’s not “if”) something goes wrong, you’ll need data to help understand and troubleshoot. If you don’t monitor and log all the things, then you’ll always be playing at least partially in the dark. I’m sure a philosophical argument could be made about never being able to truly monitor everything …. but I think the point is clear.
That being said, not every data point you collect will be a key performance indicator (KPI) nor do you need to set an alarm for every data point. You also don’t need to retain all that data for 30 years. Keep your KPIs for longer if it makes sense. The average response time of your app for example could be useful even a year later, but disk utilization of the web servers, likely not so much.
We are also often asked to recommend and implement monitoring. It’s pretty simple. Use services like datadog and loggly.
Unless your core business is monitoring/logging, you should leave the management of this cost sink to people who really and truly care about it because it’s their business. You will never make money from monitoring and instead of diverting your precious resources, you will be much better off using those resources for revenue generation.
There is one exception and that is scale. Many of the monitoring and logging services can become expensive when you’re running a lot of systems. In that case the economies of scale may tip in favor of putting together a custom solution.
Sensu is a nice tool to alarm and works much better in a cloud environment with ephemeral machines than the venerable Nagios and it’s forks. The biggest reason for that is the subscription based monitoring of the clients and ability to distribute the load much more easily than with Nagios. Much like Nagios, Sensu doesn’t come with a way to track and graph data points collected. Also like Nagios, it’s not hard to add. A good choice is InfluxDB with Grafana. Graphite is another choice instead of InfluxDB, but it’s harder to set up and maintain.
On the logging front the big hitter is the ELK Stack. Without a lot of fussing your can aggregate all of your logs and work with it through search and visualizations. There is also a really good community that can help parse and analyze log formats.
Ultimately it comes down to cost. With a lot of devices to monitor and lots of logs to aggregate you might consider running things yourself. The cost for a service can add up quickly as scale increases. Just be sure to remember that while you can run some very good tools for free, the setup and maintenance still needs to happen. If you have the staff and skill to make this unproblematic go for it. Otherwise you’ll be better of focusing on your product and creating revenue than venturing into the rabbit hole that is monitoring, logging, and metrics collection.