Metrics, monitoring, dashboards: Notes from Google's SRE Book (#7)

It's very easy to make dashboards in Datadog or New Relic or CloudWatch or whatever the monitoring system you use, but unless you know about what metric needs to be monitored for your SLOs, there will be infinite ways of missing ways. There can be hundreds of reasons for your application to not run and monitoring the container status or disk space or CPU Utilization is just not going to catch those.

Below I try to quote some ideas from the book which helped Google evolve to maintaining over tens of 9s now and appreciate the art of monitoring. I am fascinated by the part how Google leveraged all those problem-solving and design skills to make something this useful and efficient, telling us that even if we think we understand the ideas of simplicity or readability or scaling, we probably don't.


Google’s monitoring systems don’t just measure simple metrics, such as the average response time of an unladen European web server; we also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

- - -

This new model made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts. ... the history of the collected data can be used for that alert computation as well.

- - -

Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format; this enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup.

- - -

To facilitate mass collection, the metrics format had to be standardized.

- - -

Typically, a team runs a single Borgmon per cluster, and a pair at the global level.

- - -

each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default.

- - -

... using service discovery reduces the cost of maintaining it and allows the monitoring to scale.

- - -

... Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series. These rules can be quite powerful because they can query the history of a single time-series (i.e., the time axis), query different subsets of labels from many time-series at once (i.e., the space axis), and apply many mathematical operations.

- - -

... it’s better to use counters, because they don’t lose meaning when events occur between sampling intervals. Should any activity or changes occur between sampling intervals, a gauge collection is likely to miss that activity.

- - -

... example also uses a Google convention that helps readability. Each computed variable name contains a colon-separated triplet indicating the aggregation level, the variable name, and the operation that created that name.

- - -

... teams send their page-worthy alerts to their on-call rotation and their important but subcritical alerts to their ticket queues. All other alerts should be retained as informational data for status dashboards.

- - -

This decoupling allows the size of the system being monitored to scale independently of the size of alerting rules. These rules cost less to maintain because they’re abstracted over a common time-series format. New applications come ready with metric exports in all components and libraries to which they link, ...
- - -

Last quote could be taken as an action item from this post.

Releases, builds, versions: Notes from Google's SRE Book (#6)

About the messy releases.

Making sure that our tools behave correctly by default and are adequately documented makes it easy for teams to stay focused on features and users, rather than spending time reinventing the wheel (poorly) when it comes to releasing software.

- - -

… run their own release processes. Although we have thousands of engineers and products, we can achieve a high release velocity.

- We have embraced the philosophy that frequent releases result in fewer changes between versions. This approach makes testing and troubleshooting easier.

- - -

Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries. The build process is self-contained and must not rely on services that are external to the build environment.

- - -

Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline.

- - -

We also recommend creating releases at the revision number (version) of the last continuous test build that successfully completed all tests. These measures decrease the chance that subsequent changes made to the mainline will cause failures during the build performed at release time.

- - -

Most companies deal with the same set of release engineering problems regardless of their size or the tools they use: How should you handle versioning of your packages? Should you use a continuous build and deploy model, or perform periodic builds? How often should you release? What configuration management policies should you use? What release metrics are of interest?

- - -

The release engineer needs to understand the intention of how the code should be built and deployed. The developers shouldn’t build and “throw the results over the fence” to be handled by the release engineers.

- - -

In fact, SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug. Building reliability into development allows developers to focus their attention on what we really do care about—the functionality and performance of their software and systems.

- - -

… when you consider a web service that’s expected to be available 24/7, to some extent, every new line of code written is a liability.

- - -

If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation.

- - -

A point for sure to be perceived negatively by most of the overly ambitious startups:

Every time we say "no" to a feature, we are not restricting innovation; we are keeping the environment uncluttered of distractions so that focus remains squarely on innovation, and real engineering can proceed.


- - -

Toil, complaints, understanding role: Notes from Google's SRE Book (#5)

This chapter was a particularly interesting read which deals with day-to-day non-optimal ways of doing work and it goes unnoticed. It is tiring and gives a false sense of accomplishment.

Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.

- - -

Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.

- There’s a floor on the amount of toil any SRE has to handle if they are on-call. A typical SRE has one week of primary on-call and one week of secondary on-call in each cycle (for discussion of primary versus secondary on-call shifts. It follows that in a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling, which means the lower bound on potential toil is 2/6 = 33% of an SRE’s time. In an 8-person rotation, the lower bound is 2/8 = 25%.

- - -

It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.

- Your career progress will slow down or grind to a halt if you spend too little time on projects.

- - -

We work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.

- - -

If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Other teams may also start expecting SREs to take on such work.


- - -

System behavior, percentiles, SLOs: Notes from Google's SRE Book (#4)

Introduced a good alternative to Averages used in monitoring with practical examples. Usually, everyone knows that taking averages are just not enough but here, I feel, is a solution:

because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.

- for example, measuring response latency of the API does not cover the poor user latency which could be due to a slow page load or problems with the page’s JavaScript.

- - -

We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. … … User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values

- Averaging can hide a large number of slow requests and still be a good average. Percentiles help you to know how much percent of the total requests were served under a value and how much wasn’t. The behavior of an nth-percentile value will tell you the variance in response times. Good/consistent behavior, that is less variation, determines reliability than being really fast at some times and being slow at other. 
- Other examples could be, like, instantaneous load than measuring average load and seeing how well a system is serving the real-time load. 

Also, having standard templates for various SLIs, like measurement window, how frequently to measure, what all clusters to include, how data is acquired (example, at server side or client side), time to last byte vs the first byte as latency, etc help save effort and confusion.

- - -

Start by thinking about (or finding out!) what your users care about, not what you can measure.

- - -

Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.

- - -

A good SLO is a helpful, legitimate forcing function for a development team.

- - -

Compare the SLIs to the SLOs, and decide whether or not action is needed. … Without the SLO, you wouldn’t know whether (or when) to take action.


- - -

Defining objectives, performance indicators, agreements: Notes from Google's SRE Book (#3)

SLIs, SLOs and SLAs, concepts which the Indian tech companies lack.

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).

- SLI is a quantitative measure of a level of service.
- SLO is a bound on SLIs (upper or lower or target).
- SLA has the consequences if SLOs are met or not met.

Seems like SLA is not just something you can shut people on but more objective.

- - -

Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is, and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.

- - -

On the Chubby example:

In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.

- - -

SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs.

- - -

Whether or not a particular service has an SLA, it’s valuable to define SLIs and SLOs and use them to manage the service.
- - -

Downtime, outages, service risk: Notes from Google's SRE Book (#2)

I failed to deliver upon my own commitment to reading some pages of the book each day but it hasn't demotivated me from continuing because the book is too good.


To make this problem tractable and consistent across many types of systems we run, we focus on unplanned downtime.

- For service risk, it is not clear how to reduce all the potential factors of degraded performance into a metric. Degraded or unreliable performance can impact user satisfaction, revenue, loss of trust, and most of these factors are hard to measure. 
- A metric which measures an unexpected service downtime can be a property of the system we want to optimize. Unplanned downtime is not just when the system went down but it could accommodate every other request which failed or delay in serving the request or anything which affects the users.

- - -

… instead of using metrics around uptime, we define availability in terms of the request success rate.

- measuring service risk in terms of objective metrics:
  1. time-based availability over a period of time (last 30 days or last 3 months): uptime/uptime+downtime 
  2. another measure could be 'request success rate': successful/total-requests over a 1-day window

the second metric can be made to use for batch processes, pipelines, storage etc with minimal modifications. Example, for a pipeline which reads records from a CSV file, transforms and puts in a database, the ratio of the number of records successfully processed to unsuccessfully processed can be a similar availability measure.

- - -

In 2006, … … We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.

- - -

Which is worse for the service: a constant low rate of failures, or an occasional full-site outage? Both types of failure may result in the same absolute number of errors, but may have vastly different impacts on the business.

- - -

The requirements for building and running infrastructure components differ from the requirements for consumer products in a number of ways.

- Example, consumer services using BigTable in the path of a user request require low latency, empty request queues almost all the time whereas infrastructure services using BigTable for offline analysis require the queue of tasks to never be empty and want more throughput. 

- - -

Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, … 

- such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.

- - -

A point which I feel is very much applicable to startups:

Usually, pre-existing teams have worked out some kind of informal balance between them as to where the risk/effort boundary lies. Unfortunately, one can rarely prove that this balance is optimal, rather than just a function of the negotiating skills of the engineers involved. Nor should such decisions be driven by politics, fear, or hope.


- Instead it’s better to come up with an objective metric agreed upon by both the engineering team and the product team which can be used to guide the negotiations in a reproducible way and without resent. The more data-based the decision can be, the better in terms of defending the decision.

- - -

AWS instance reservation simplified

A while ago I was working on AWS cost reduction at my company and reserving instances is one essential step in such a project. This required me to go through so many AWS docs on different types of reservation options provided to get the whole picture. I tried to summarize the important points here, some of them which could be missed if one does not go through all the documents carefully:
Reserving, Modification and Exchanges are different.
  • how much you want to reserve and what you do with them is up to you, discounts will get applied accordingly.
  • While modifying the reserved, instance footprint should match (normalization factor x number of instances of that family (example, m4.))
  • You can't modify it to a lesser footprint. If you need a greater reserved footprint, then simply purchase more reserved.
  • You cannot modify instances for which only one option is there, like i3.metal.
  • The instances you want to modify cannot be in the marketplace at the same time, of course.
  • While Exchanging offers exchanging the instance with any type, or size, you cannot control the number of instances. AWS will just give you any number of instances to maintain the resulting cost, equal to or higher than what you are paying for the original instances.
  • Another IMPORTANT point after saying above is that you can exchange any number of any type of instances but for only 1 type of instance.
  • The above also means you can split the reservation and exchange them for different types of instances.
The math works out as long as we keep the total normalization units the same or higher and in most cases, keep the instance family the same as well.
For example, in case of modification, let’s say you already own an RI for a c4.8xlarge. This RI now applies to any usage of a Linux/UNIX C4 instance with shared tenancy in the region. This could be:
One c4.8xlarge instance, or
Two c4.4xlarge instances, or
Four c4.2xlarge instances, or
Sixteen c4.large instances, or
One c4.4xlarge and eight c4.large, and more.
For example, in the case of an exchange, if you have a t2.large CRI, to change it to a smaller t2.medium instance and an m3.medium instance:
Step 1. Modify the one t2.large CRI by splitting it into two t2.medium CRIs z
Step 2. Exchange one of the new t2.medium CRIs for an m3.medium CRI, then.
It's easier to think about reserving in terms of normalization units now.
DOUBT:
I wonder if someone would do something like this:
  • You have a reservation with two t2.micro instances (giving you a footprint of 1) and a reservation with one t2.small instance (giving you a footprint of 1). You merge both reservations to a single reservation with one t2.medium instance—the combined instance size footprint of the two original reservations equals the footprint of the modified reservation.
  • In both cases, you will get charged the same. Not sure.

SRE skills, goals, culture: Notes from Google's SRE Book (#1)

I will be trying to put my notes from the famous Google SRE book whatever I felt important from a section or a paragraph. I didn't care for similar kind of notes or summaries which may be available on the internet. This is just for my motivational purposes to help me continue reading and post some of my learnings here.

The plan is to have a post each week for 10 weeks in a row. Let's see how it works out.

The following is the first set of my notes from Sep 30:


50-60% are SEs and 40-50% are close to SEs with 85-99% of the skillset required, and who in addition had a set of technical skills that are useful to SRE but is rare for most SEs …

- common additional skills for SRE:
  • UNIX system internals
  • networking (later 1 to 3)

- - -

the team tasked with managing a service needs to code or it will drown. Google places a 50% cap on the aggregator “ops” work for all SREs — tickets, on-call, manual tasks, etc.

- - -

... have to measure how SRE time is spent. With that measurement in hand, we ensure that the teams consistently spending less than 50% of their time on development change their practices. this could mean shifting some of the ops work to the dev team. or at times, adding a staff.

- while this may seem ridiculous at the time, but the principle which has been set at the core, has to be followed in order to not fall apart later. SREs should have the bandwidth to engage in creative, autonomous engineering (which is the ideal goal) constantly. Not able to manage this is a weakness in the foundation of a company from the very beginning which aims to scale to millions.

- - -

Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

- - -

SRE’s goal is no longer ”zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. … An outage is no longer a “bad” thing — it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

- 100% reliability is wrong reliability target. No user can tell the difference between 100% and 99.999% available system. so, putting a huge effort in that 0.001% has no benefit. What level of availability will the users be happy with, given how they use the product - should be taken into account while setting a reliability target. `1 - that target` is the error budget which we can spend on anything we want, example, taking risks while launching new features quickly.

- - -

The hero jack-of-all-trades on-call engineer does work, but the practice on-call engineer armed with a playbook works much better.

- when humans are necessary in emergency response, thinking through and recording the best practices ahead of time in a “playbook” produces roughly a three times improvement in mean-time-to-repair as compared to acting without preparation (“winging it”). This also ensures the newcomers get the chance to learn about the system and the company has less dependency on any single person (“hero”).

- - -

By removing humans from the loop (of change management), these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.

- changes in a live system are the prominent source of outages. Having progressive rollouts, quickly detecting problems and rolling back changes in case of problems, minimize the number of users exposed.

- - -

… that a surprising number of services and teams don’t take the steps necessary to ensure that the required capacity (for future demand) is in place by the team it is needed.

- both organic and inorganic demand should be taken into account while demand forecasting and then, provision accordingly. Regular load testing to correlate infrastructure-capacity to service-capacity should be done.

- - -

Resource use is a function of load, capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software.

- - -

Reference book:
Site Reliability Engineering, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly). Copyright 2016 Google, Inc., 978-1-491-92912-4.


Learning brush strokes I

I was today looking at one of my glasses paintings I made in junior school years. I had drawn two butterflies with free hand using a black liner. While this is not something big I noticed, both the butterflies have different sizes and shapes but they have the same curves. The depth, the arcs at a very minute level (half a centimeter) reflect the same hand which drew them, anyone looking at the two pictures can infer it.

I wonder if there is a way to learn this. Not just learning what type of object it is or what species of butterflies it is but the curves, brush strokes and the drawing style which is very identifiable to a person and truly personalized. Learning the way different painters move their hands rendering every drawing ever created unique.

 

Hope to see more on it.