System behavior, percentiles, SLOs: Notes from Google's SRE Book (#4)

Introduced a good alternative to Averages used in monitoring with practical examples. Usually, everyone knows that taking averages are just not enough but here, I feel, is a solution:

because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.

- for example, measuring response latency of the API does not cover the poor user latency which could be due to a slow page load or problems with the page’s JavaScript.

- - -

We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. … … User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values

- Averaging can hide a large number of slow requests and still be a good average. Percentiles help you to know how much percent of the total requests were served under a value and how much wasn’t. The behavior of an nth-percentile value will tell you the variance in response times. Good/consistent behavior, that is less variation, determines reliability than being really fast at some times and being slow at other. 
- Other examples could be, like, instantaneous load than measuring average load and seeing how well a system is serving the real-time load. 

Also, having standard templates for various SLIs, like measurement window, how frequently to measure, what all clusters to include, how data is acquired (example, at server side or client side), time to last byte vs the first byte as latency, etc help save effort and confusion.

- - -

Start by thinking about (or finding out!) what your users care about, not what you can measure.

- - -

Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.

- - -

A good SLO is a helpful, legitimate forcing function for a development team.

- - -

Compare the SLIs to the SLOs, and decide whether or not action is needed. … Without the SLO, you wouldn’t know whether (or when) to take action.


- - -

No comments:

Post a Comment