Downtime, outages, service risk: Notes from Google's SRE Book (#2)

I failed to deliver upon my own commitment to reading some pages of the book each day but it hasn't demotivated me from continuing because the book is too good.


To make this problem tractable and consistent across many types of systems we run, we focus on unplanned downtime.

- For service risk, it is not clear how to reduce all the potential factors of degraded performance into a metric. Degraded or unreliable performance can impact user satisfaction, revenue, loss of trust, and most of these factors are hard to measure. 
- A metric which measures an unexpected service downtime can be a property of the system we want to optimize. Unplanned downtime is not just when the system went down but it could accommodate every other request which failed or delay in serving the request or anything which affects the users.

- - -

… instead of using metrics around uptime, we define availability in terms of the request success rate.

- measuring service risk in terms of objective metrics:
  1. time-based availability over a period of time (last 30 days or last 3 months): uptime/uptime+downtime 
  2. another measure could be 'request success rate': successful/total-requests over a 1-day window

the second metric can be made to use for batch processes, pipelines, storage etc with minimal modifications. Example, for a pipeline which reads records from a CSV file, transforms and puts in a database, the ratio of the number of records successfully processed to unsuccessfully processed can be a similar availability measure.

- - -

In 2006, … … We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.

- - -

Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.

- - -

The requirements for building and running infrastructure components differ from the requirements for consumer products in a number of ways.

- Example, consumer services using BigTable in the path of a user request require low latency, empty request queues almost all the time whereas infrastructure services using BigTable for offline analysis require the queue of tasks to never be empty and want more throughput. 

- - -

Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, … 

- such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.

- - -

A point which I feel is very much applicable to startups:

Usually, pre-existing teams have worked out some kind of informal balance between them as to where the risk/effort boundary lies. Unfortunately, one can rarely prove that this balance is optimal, rather than just a function of the negotiating skills of the engineers involved. Nor should such decisions be driven by politics, fear, or hope.


- Instead it’s better to come up with an objective metric agreed upon by both the engineering team and the product team which can be used to guide the negotiations in a reproducible way and without resent. The more data-based the decision can be, the better in terms of defending the decision.

- - -

No comments:

Post a Comment