Metrics, monitoring, dashboards: Notes from Google's SRE Book (#7)

It's very easy to make dashboards in Datadog or New Relic or CloudWatch or whatever the monitoring system you use, but unless you know about what metric needs to be monitored for your SLOs, there will be infinite ways of missing ways. There can be hundreds of reasons for your application to not run and monitoring the container status or disk space or CPU Utilization is just not going to catch those.

Below I try to quote some ideas from the book which helped Google evolve to maintaining over tens of 9s now and appreciate the art of monitoring. I am fascinated by the part how Google leveraged all those problem-solving and design skills to make something this useful and efficient, telling us that even if we think we understand the ideas of simplicity or readability or scaling, we probably don't.


Google’s monitoring systems don’t just measure simple metrics, such as the average response time of an unladen European web server; we also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

- - -

This new model made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts. ... the history of the collected data can be used for that alert computation as well.

- - -

Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format; this enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup.

- - -

To facilitate mass collection, the metrics format had to be standardized.

- - -

Typically, a team runs a single Borgmon per cluster, and a pair at the global level.

- - -

each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default.

- - -

... using service discovery reduces the cost of maintaining it and allows the monitoring to scale.

- - -

... Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series. These rules can be quite powerful because they can query the history of a single time-series (i.e., the time axis), query different subsets of labels from many time-series at once (i.e., the space axis), and apply many mathematical operations.

- - -

... it’s better to use counters, because they don’t lose meaning when events occur between sampling intervals. Should any activity or changes occur between sampling intervals, a gauge collection is likely to miss that activity.

- - -

... example also uses a Google convention that helps readability. Each computed variable name contains a colon-separated triplet indicating the aggregation level, the variable name, and the operation that created that name.

- - -

... teams send their page-worthy alerts to their on-call rotation and their important but subcritical alerts to their ticket queues. All other alerts should be retained as informational data for status dashboards.

- - -

This decoupling allows the size of the system being monitored to scale independently of the size of alerting rules. These rules cost less to maintain because they’re abstracted over a common time-series format. New applications come ready with metric exports in all components and libraries to which they link, ...
- - -

Last quote could be taken as an action item from this post.

No comments:

Post a Comment