System behavior, percentiles, SLOs: Notes from Google's SRE Book (#4)

Introduced a good alternative to Averages used in monitoring with practical examples. Usually, everyone knows that taking averages are just not enough but here, I feel, is a solution:

because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.

- for example, measuring response latency of the API does not cover the poor user latency which could be due to a slow page load or problems with the page’s JavaScript.

- - -

We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. … … User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values

- Averaging can hide a large number of slow requests and still be a good average. Percentiles help you to know how much percent of the total requests were served under a value and how much wasn’t. The behavior of an nth-percentile value will tell you the variance in response times. Good/consistent behavior, that is less variation, determines reliability than being really fast at some times and being slow at other. 
- Other examples could be, like, instantaneous load than measuring average load and seeing how well a system is serving the real-time load. 

Also, having standard templates for various SLIs, like measurement window, how frequently to measure, what all clusters to include, how data is acquired (example, at server side or client side), time to last byte vs the first byte as latency, etc help save effort and confusion.

- - -

Start by thinking about (or finding out!) what your users care about, not what you can measure.

- - -

Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.

- - -

A good SLO is a helpful, legitimate forcing function for a development team.

- - -

Compare the SLIs to the SLOs, and decide whether or not action is needed. … Without the SLO, you wouldn’t know whether (or when) to take action.


- - -

Defining objectives, performance indicators, agreements: Notes from Google's SRE Book (#3)

SLIs, SLOs and SLAs, concepts which the Indian tech companies lack.

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).

- SLI is a quantitative measure of a level of service.
- SLO is a bound on SLIs (upper or lower or target).
- SLA has the consequences if SLOs are met or not met.

Seems like SLA is not just something you can shut people on but more objective.

- - -

Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is, and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.

- - -

On the Chubby example:

In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.

- - -

SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs.

- - -

Whether or not a particular service has an SLA, it’s valuable to define SLIs and SLOs and use them to manage the service.
- - -

Downtime, outages, service risk: Notes from Google's SRE Book (#2)

I failed to deliver upon my own commitment to reading some pages of the book each day but it hasn't demotivated me from continuing because the book is too good.


To make this problem tractable and consistent across many types of systems we run, we focus on unplanned downtime.

- For service risk, it is not clear how to reduce all the potential factors of degraded performance into a metric. Degraded or unreliable performance can impact user satisfaction, revenue, loss of trust, and most of these factors are hard to measure. 
- A metric which measures an unexpected service downtime can be a property of the system we want to optimize. Unplanned downtime is not just when the system went down but it could accommodate every other request which failed or delay in serving the request or anything which affects the users.

- - -

… instead of using metrics around uptime, we define availability in terms of the request success rate.

- measuring service risk in terms of objective metrics:
  1. time-based availability over a period of time (last 30 days or last 3 months): uptime/uptime+downtime 
  2. another measure could be 'request success rate': successful/total-requests over a 1-day window

the second metric can be made to use for batch processes, pipelines, storage etc with minimal modifications. Example, for a pipeline which reads records from a CSV file, transforms and puts in a database, the ratio of the number of records successfully processed to unsuccessfully processed can be a similar availability measure.

- - -

In 2006, … … We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.

- - -

Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.

- - -

The requirements for building and running infrastructure components differ from the requirements for consumer products in a number of ways.

- Example, consumer services using BigTable in the path of a user request require low latency, empty request queues almost all the time whereas infrastructure services using BigTable for offline analysis require the queue of tasks to never be empty and want more throughput. 

- - -

Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, … 

- such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.

- - -

A point which I feel is very much applicable to startups:

Usually, pre-existing teams have worked out some kind of informal balance between them as to where the risk/effort boundary lies. Unfortunately, one can rarely prove that this balance is optimal, rather than just a function of the negotiating skills of the engineers involved. Nor should such decisions be driven by politics, fear, or hope.


- Instead it’s better to come up with an objective metric agreed upon by both the engineering team and the product team which can be used to guide the negotiations in a reproducible way and without resent. The more data-based the decision can be, the better in terms of defending the decision.

- - -

AWS instance reservation simplified

A while ago I was working on AWS cost reduction at my company and reserving instances is one essential step in such a project. This required me to go through so many AWS docs on different types of reservation options provided to get the whole picture. I tried to summarize the important points here, some of them which could be missed if one does not go through all the documents carefully:
Reserving, Modification and Exchanges are different.
  • how much you want to reserve and what you do with them is up to you, discounts will get applied accordingly.
  • While modifying the reserved, instance footprint should match (normalization factor x number of instances of that family (example, m4.))
  • You can't modify it to a lesser footprint. If you need a greater reserved footprint, then simply purchase more reserved.
  • You cannot modify instances for which only one option is there, like i3.metal.
  • The instances you want to modify cannot be in the marketplace at the same time, of course.
  • While Exchanging offers exchanging the instance with any type, or size, you cannot control the number of instances. AWS will just give you any number of instances to maintain the resulting cost, equal to or higher than what you are paying for the original instances.
  • Another IMPORTANT point after saying above is that you can exchange any number of any type of instances but for only 1 type of instance.
  • The above also means you can split the reservation and exchange them for different types of instances.
The math works out as long as we keep the total normalization units the same or higher and in most cases, keep the instance family the same as well.
For example, in case of modification, let’s say you already own an RI for a c4.8xlarge. This RI now applies to any usage of a Linux/UNIX C4 instance with shared tenancy in the region. This could be:
One c4.8xlarge instance, or
Two c4.4xlarge instances, or
Four c4.2xlarge instances, or
Sixteen c4.large instances, or
One c4.4xlarge and eight c4.large, and more.
For example, in the case of an exchange, if you have a t2.large CRI, to change it to a smaller t2.medium instance and an m3.medium instance:
Step 1. Modify the one t2.large CRI by splitting it into two t2.medium CRIs z
Step 2. Exchange one of the new t2.medium CRIs for an m3.medium CRI, then.
It's easier to think about reserving in terms of normalization units now.
DOUBT:
I wonder if someone would do something like this:
  • You have a reservation with two t2.micro instances (giving you a footprint of 1) and a reservation with one t2.small instance (giving you a footprint of 1). You merge both reservations to a single reservation with one t2.medium instance—the combined instance size footprint of the two original reservations equals the footprint of the modified reservation.
  • In both cases, you will get charged the same. Not sure.

SRE skills, goals, culture: Notes from Google's SRE Book (#1)

I will be trying to put my notes from the famous Google SRE book whatever I felt important from a section or a paragraph. I didn't care for similar kind of notes or summaries which may be available on the internet. This is just for my motivational purposes to help me continue reading and post some of my learnings here.

The plan is to have a post each week for 10 weeks in a row. Let's see how it works out.

The following is the first set of my notes from Sep 30:


50-60% are SEs and 40-50% are close to SEs with 85-99% of the skillset required, and who in addition had a set of technical skills that are useful to SRE but is rare for most SEs …

- common additional skills for SRE:
  • UNIX system internals
  • networking (later 1 to 3)

- - -

the team tasked with managing a service needs to code or it will drown. Google places a 50% cap on the aggregator “ops” work for all SREs — tickets, on-call, manual tasks, etc.

- - -

... have to measure how SRE time is spent. With that measurement in hand, we ensure that the teams consistently spending less than 50% of their time on development change their practices. this could mean shifting some of the ops work to the dev team. or at times, adding a staff.

- while this may seem ridiculous at the time, but the principle which has been set at the core, has to be followed in order to not fall apart later. SREs should have the bandwidth to engage in creative, autonomous engineering (which is the ideal goal) constantly. Not able to manage this is a weakness in the foundation of a company from the very beginning which aims to scale to millions.

- - -

Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

- - -

SRE’s goal is no longer ”zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. … An outage is no longer a “bad” thing — it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

- 100% reliability is wrong reliability target. No user can tell the difference between 100% and 99.999% available system. so, putting a huge effort in that 0.001% has no benefit. What level of availability will the users be happy with, given how they use the product - should be taken into account while setting a reliability target. `1 - that target` is the error budget which we can spend on anything we want, example, taking risks while launching new features quickly.

- - -

The hero jack-of-all-trades on-call engineer does work, but the practice on-call engineer armed with a playbook works much better.

- when humans are necessary in emergency response, thinking through and recording the best practices ahead of time in a “playbook” produces roughly a three times improvement in mean-time-to-repair as compared to acting without preparation (“winging it”). This also ensures the newcomers get the chance to learn about the system and the company has less dependency on any single person (“hero”).

- - -

By removing humans from the loop (of change management), these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.

- changes in a live system are the prominent source of outages. Having progressive rollouts, quickly detecting problems and rolling back changes in case of problems, minimize the number of users exposed.

- - -

… that a surprising number of services and teams don’t take the steps necessary to ensure that the required capacity (for future demand) is in place by the team it is needed.

- both organic and inorganic demand should be taken into account while demand forecasting and then, provision accordingly. Regular load testing to correlate infrastructure-capacity to service-capacity should be done.

- - -

Resource use is a function of load, capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software.

- - -

Reference book:
Site Reliability Engineering, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly). Copyright 2016 Google, Inc., 978-1-491-92912-4.