Metrics, monitoring, dashboards: Notes from Google's SRE Book (#7)

It's very easy to make dashboards in Datadog or New Relic or CloudWatch or whatever the monitoring system you use, but unless you know about what metric needs to be monitored for your SLOs, there will be infinite ways of missing ways. There can be hundreds of reasons for your application to not run and monitoring the container status or disk space or CPU Utilization is just not going to catch those.

Below I try to quote some ideas from the book which helped Google evolve to maintaining over tens of 9s now and appreciate the art of monitoring. I am fascinated by the part how Google leveraged all those problem-solving and design skills to make something this useful and efficient, telling us that even if we think we understand the ideas of simplicity or readability or scaling, we probably don't.


Google’s monitoring systems don’t just measure simple metrics, such as the average response time of an unladen European web server; we also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

- - -

This new model made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts. ... the history of the collected data can be used for that alert computation as well.

- - -

Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format; this enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup.

- - -

To facilitate mass collection, the metrics format had to be standardized.

- - -

Typically, a team runs a single Borgmon per cluster, and a pair at the global level.

- - -

each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default.

- - -

... using service discovery reduces the cost of maintaining it and allows the monitoring to scale.

- - -

... Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series. These rules can be quite powerful because they can query the history of a single time-series (i.e., the time axis), query different subsets of labels from many time-series at once (i.e., the space axis), and apply many mathematical operations.

- - -

... it’s better to use counters, because they don’t lose meaning when events occur between sampling intervals. Should any activity or changes occur between sampling intervals, a gauge collection is likely to miss that activity.

- - -

... example also uses a Google convention that helps readability. Each computed variable name contains a colon-separated triplet indicating the aggregation level, the variable name, and the operation that created that name.

- - -

... teams send their page-worthy alerts to their on-call rotation and their important but subcritical alerts to their ticket queues. All other alerts should be retained as informational data for status dashboards.

- - -

This decoupling allows the size of the system being monitored to scale independently of the size of alerting rules. These rules cost less to maintain because they’re abstracted over a common time-series format. New applications come ready with metric exports in all components and libraries to which they link, ...
- - -

Last quote could be taken as an action item from this post.

Releases, builds, versions: Notes from Google's SRE Book (#6)

About the messy releases.

Making sure that our tools behave correctly by default and are adequately documented makes it easy for teams to stay focused on features and users, rather than spending time reinventing the wheel (poorly) when it comes to releasing software.

- - -

… run their own release processes. Although we have thousands of engineers and products, we can achieve a high release velocity.

- We have embraced the philosophy that frequent releases result in fewer changes between versions. This approach makes testing and troubleshooting easier.

- - -

Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries. The build process is self-contained and must not rely on services that are external to the build environment.

- - -

Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline.

- - -

We also recommend creating releases at the revision number (version) of the last continuous test build that successfully completed all tests. These measures decrease the chance that subsequent changes made to the mainline will cause failures during the build performed at release time.

- - -

Most companies deal with the same set of release engineering problems regardless of their size or the tools they use: How should you handle versioning of your packages? Should you use a continuous build and deploy model, or perform periodic builds? How often should you release? What configuration management policies should you use? What release metrics are of interest?

- - -

The release engineer needs to understand the intention of how the code should be built and deployed. The developers shouldn’t build and “throw the results over the fence” to be handled by the release engineers.

- - -

In fact, SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug. Building reliability into development allows developers to focus their attention on what we really do care about—the functionality and performance of their software and systems.

- - -

… when you consider a web service that’s expected to be available 24/7, to some extent, every new line of code written is a liability.

- - -

If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation.

- - -

A point for sure to be perceived negatively by most of the overly ambitious startups:

Every time we say "no" to a feature, we are not restricting innovation; we are keeping the environment uncluttered of distractions so that focus remains squarely on innovation, and real engineering can proceed.


- - -

Toil, complaints, understanding role: Notes from Google's SRE Book (#5)

This chapter was a particularly interesting read which deals with day-to-day non-optimal ways of doing work and it goes unnoticed. It is tiring and gives a false sense of accomplishment.

Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.

- - -

Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.

- There’s a floor on the amount of toil any SRE has to handle if they are on-call. A typical SRE has one week of primary on-call and one week of secondary on-call in each cycle (for discussion of primary versus secondary on-call shifts. It follows that in a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling, which means the lower bound on potential toil is 2/6 = 33% of an SRE’s time. In an 8-person rotation, the lower bound is 2/8 = 25%.

- - -

It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.

- Your career progress will slow down or grind to a halt if you spend too little time on projects.

- - -

We work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.

- - -

If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Other teams may also start expecting SREs to take on such work.


- - -