SRE skills, goals, culture: Notes from Google's SRE Book (#1)

I will be trying to put my notes from the famous Google SRE book whatever I felt important from a section or a paragraph. I didn't care for similar kind of notes or summaries which may be available on the internet. This is just for my motivational purposes to help me continue reading and post some of my learnings here.

The plan is to have a post each week for 10 weeks in a row. Let's see how it works out.

The following is the first set of my notes from Sep 30:


50-60% are SEs and 40-50% are close to SEs with 85-99% of the skillset required, and who in addition had a set of technical skills that are useful to SRE but is rare for most SEs …

- common additional skills for SRE:
  • UNIX system internals
  • networking (later 1 to 3)

- - -

the team tasked with managing a service needs to code or it will drown. Google places a 50% cap on the aggregator “ops” work for all SREs — tickets, on-call, manual tasks, etc.

- - -

... have to measure how SRE time is spent. With that measurement in hand, we ensure that the teams consistently spending less than 50% of their time on development change their practices. this could mean shifting some of the ops work to the dev team. or at times, adding a staff.

- while this may seem ridiculous at the time, but the principle which has been set at the core, has to be followed in order to not fall apart later. SREs should have the bandwidth to engage in creative, autonomous engineering (which is the ideal goal) constantly. Not able to manage this is a weakness in the foundation of a company from the very beginning which aims to scale to millions.

- - -

Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

- - -

SRE’s goal is no longer ”zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. … An outage is no longer a “bad” thing — it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

- 100% reliability is wrong reliability target. No user can tell the difference between 100% and 99.999% available system. so, putting a huge effort in that 0.001% has no benefit. What level of availability will the users be happy with, given how they use the product - should be taken into account while setting a reliability target. `1 - that target` is the error budget which we can spend on anything we want, example, taking risks while launching new features quickly.

- - -

The hero jack-of-all-trades on-call engineer does work, but the practice on-call engineer armed with a playbook works much better.

- when humans are necessary in emergency response, thinking through and recording the best practices ahead of time in a “playbook” produces roughly a three times improvement in mean-time-to-repair as compared to acting without preparation (“winging it”). This also ensures the newcomers get the chance to learn about the system and the company has less dependency on any single person (“hero”).

- - -

By removing humans from the loop (of change management), these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.

- changes in a live system are the prominent source of outages. Having progressive rollouts, quickly detecting problems and rolling back changes in case of problems, minimize the number of users exposed.

- - -

… that a surprising number of services and teams don’t take the steps necessary to ensure that the required capacity (for future demand) is in place by the team it is needed.

- both organic and inorganic demand should be taken into account while demand forecasting and then, provision accordingly. Regular load testing to correlate infrastructure-capacity to service-capacity should be done.

- - -

Resource use is a function of load, capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software.

- - -

Reference book:
Site Reliability Engineering, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly). Copyright 2016 Google, Inc., 978-1-491-92912-4.


No comments:

Post a Comment