umesh k singla

GSoC Mentor Summit 2019

I was lucky to get the chance to attend the Google Summer of Code Mentor Summit two years in a row. It was hosted in Munich this year, first time outside Silicon Valley, from October 17 to October 20. I was a student participant with MacPorts organization in 2017, and later took the role of org admin in 2018 and co-mentor in 2019.

If you are not familiar with the Google Summer of Code, it is a global program focused on introducing students to open-source software development. Student participants are paired with a mentor from the participating organizations, gaining exposure to real-world software development and techniques. Students have the opportunity to spend the break between their school semesters earning a stipend while working in areas related to their interests. Google invites two mentors each year in autumn for an unconference.

MacPorts is a package manager for macOS which tries to simplify the packaging and installation of open-source software on Mac systems. It is the only package management system that supports macOS as far back as 10.5 for a majority of the packages. Luckily or unluckily, Homebrew wasn't there.

Together with Mojca Miklavec, I mentored Arjun Salyan who did a great job at implementing the long-standing feature in MacPorts to be able to view ports' information on a web interface along with open issues, build history, and user installation statistics. We would have a two-hour hangout call on Sunday afternoons the last summer to discuss the progress. We left a huge list of improvements and features for another summer maybe. GitHub Issues and the dev mailing list made it convenient for the community to review code and give feedback alongside the development.

My blurred shot of Allianz Arena

Google arranged a scavenger hunt across the city of Munich and provided us with all-day public transit passes. Munich is a very well-connected city and is home to centuries-old buildings even in the middle of the city. You can basically see glass elevators in the middle of the roads.

The two remaining days were reserved for unconference sessions on various topics. I attended a number of different sessions but also used the “hallway track” to get in touch with others. I had a far better experience this time than my first time at the summit in 2018. It definitely felt good to discuss python packaging issues I am having at work with Francesco Bonazzi who works for SymPy and also at Max Planck Institute in Germany on symbolic computing. I also got feedback (or an enhancement request I'd say) on my migration project from 2017 from a senior developer at RTEMS.

There were multiple notable sessions. One of them discussed the possibility of having a GSoC-like program for documentation and technical writers, and the challenges behind it. How people who are not accustomed to coding practices and version control can leverage Github.

A set of sessions I attended was around the growth of the smaller open-source communities, how they can finance themselves and employ people, and how more permissible and less vague software licenses may help. This might need convincing large companies to use open-source. A big problem we face at MacPorts is the availability of free hardware for testing pull requests and building binaries because of the expensive nature of Macs. It's been solved to a large extent by the recent release of GitHub Actions and Azure pipelines, providing 6 hours of build time which is a significant increase from 50 mins at Travis CI but is still a hurdle in bringing new contributors on board.

People associated with universities and research labs had a very interesting series of sessions on open science. It does help to know how the work done in physics, biology or neuroscience needs to be reproducible and accessible to practically realize its impact. There is a huge lack of practice/awareness among science academia to document, code, and release their collected datasets and experiment details in a manner that it can be reused by scientific communities outside their research groups. Scanning the brains of 20 people from North America for hints of drug use is largely ineffective unless it can be reproduced across demographics.

My favorite session was a loose discussion on the importance of #offtopic or #random channels hosted by coala's Maximilian Scholz. By sharing content unrelated to work and just "fooling around", it helps the developers to know more about the personality of the community, and the senior members and hence, more potential to connect, or in some cases, lifelong friendships.

I noticed a number of projects working on building apps for helping younger kids learn to code through mobile games. (I think?) I made friends with people from NumFOCUS (which supports scipy, numpy, pandas among many other scientific computing tools) and Zulip (an open-source alternative to Slack).

I found it fascinating how the Google OSPO team managed to get 330 people from 42 different countries to agree on the discussion topics. They would put up a board with time slots and rooms, and people would suggest topics and indicate interests using mere post-it notes and dot stickers. No exhausting long discussions and voting procedures just for deciding the topics.

I thoroughly enjoyed the evenings and discussions with other open-source projects. Google had authentic Bavarian cuisine arranged for us. This certainly included the beer, of which we managed to finish 3 whole barrels on the first evening itself. I cannot forget the chocolate room, full of the weirdest chocolates which people brought from all corners of the world.

25% on-call rule, paging, actionable alerts: Notes from Google's SRE Book (#8)

SREs are on-call for the services they support. The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems. These problems, which typically fall in the operational domain, exist at a scale that would be intractable without software engineering solutions.

- - -

The quantity of on-call can be calculated by the percent of time spent by engineers on on-call duties. The quality of on-call can be calculated by the number of incidents that occur during an on-call shift.

- - -

Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month. For dual-site teams, a reasonable minimum size of each team is six, both to honor the 25% rule and to ensure a substantial and critical mass of engineers for the team.

- - -

We’ve found that on average, dealing with the tasks involved in an on-call incident—root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs—takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift. In order to stay within this upper bound, the distribution of paging events should be very flat over time,

- - -

Google offers time-off-in-lieu or straight cash compensation, capped at some proportion of overall salary. The compensation cap represents, in practice, a limit on the amount of on-call work that will be taken on by any individual.

- - -

To make sure that the engineers are in the appropriate frame of mind to leverage the latter mindset, it’s important to reduce the stress related to being on-call.

- - -

And the most important point relevant to the companies trying to monitor their services at any scale here:
... when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.

- - -

Paging alerts should be aligned with the symptoms that threaten a service’s SLOs. All paging alerts should also be actionable. Low-priority alerts that bother the on-call engineer every hour (or more frequently) disrupt productivity, and the fatigue such alerts induce can also cause serious alerts to be treated with less attention than necessary.

Sometimes the changes that cause operational overload are not under the control of the SRE teams. For example, the application developers might introduce changes that cause the system to be more noisy, less reliable, or both. In this case, it is appropriate to work together with the application developers to set common goals to improve the system.

In extreme cases, SRE teams may have the option to "give back the pager"—SRE can ask the developer team to be exclusively on-call for the system until it meets the standards of the SRE team in question.

- - -

... while knowledge gaps are discovered only when an incident occurs. ... To counteract this eventuality, SRE teams should be sized to allow every engineer to be on-call at least once or twice a quarter, thus ensuring that each team member is sufficiently exposed to production.
- - -

Metrics, monitoring, dashboards: Notes from Google's SRE Book (#7)

It's very easy to make dashboards in Datadog or New Relic or CloudWatch or whatever the monitoring system you use, but unless you know about what metric needs to be monitored for your SLOs, there will be infinite ways of missing ways. There can be hundreds of reasons for your application to not run and monitoring the container status or disk space or CPU Utilization is just not going to catch those.

Below I try to quote some ideas from the book which helped Google evolve to maintaining over tens of 9s now and appreciate the art of monitoring. I am fascinated by the part how Google leveraged all those problem-solving and design skills to make something this useful and efficient, telling us that even if we think we understand the ideas of simplicity or readability or scaling, we probably don't.

Google’s monitoring systems don’t just measure simple metrics, such as the average response time of an unladen European web server; we also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

- - -

This new model made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts. ... the history of the collected data can be used for that alert computation as well.

- - -

Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format; this enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup.

- - -

To facilitate mass collection, the metrics format had to be standardized.

- - -

Typically, a team runs a single Borgmon per cluster, and a pair at the global level.

- - -

each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default.

- - -

... using service discovery reduces the cost of maintaining it and allows the monitoring to scale.

- - -

... Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series. These rules can be quite powerful because they can query the history of a single time-series (i.e., the time axis), query different subsets of labels from many time-series at once (i.e., the space axis), and apply many mathematical operations.

- - -

... it’s better to use counters, because they don’t lose meaning when events occur between sampling intervals. Should any activity or changes occur between sampling intervals, a gauge collection is likely to miss that activity.

- - -

... example also uses a Google convention that helps readability. Each computed variable name contains a colon-separated triplet indicating the aggregation level, the variable name, and the operation that created that name.

- - -

... teams send their page-worthy alerts to their on-call rotation and their important but subcritical alerts to their ticket queues. All other alerts should be retained as informational data for status dashboards.

- - -

This decoupling allows the size of the system being monitored to scale independently of the size of alerting rules. These rules cost less to maintain because they’re abstracted over a common time-series format. New applications come ready with metric exports in all components and libraries to which they link, ...

- - -

Last quote could be taken as an action item from this post.

Wonderful day it is!

Releases, builds, versions: Notes from Google's SRE Book (#6)

About the messy releases.

Making sure that our tools behave correctly by default and are adequately documented makes it easy for teams to stay focused on features and users, rather than spending time reinventing the wheel (poorly) when it comes to releasing software.

- - -

… run their own release processes. Although we have thousands of engineers and products, we can achieve a high release velocity.

- We have embraced the philosophy that frequent releases result in fewer changes between versions. This approach makes testing and troubleshooting easier.

- - -

Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries. The build process is self-contained and must not rely on services that are external to the build environment.

- - -

Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline.

- - -

We also recommend creating releases at the revision number (version) of the last continuous test build that successfully completed all tests. These measures decrease the chance that subsequent changes made to the mainline will cause failures during the build performed at release time.

- - -

Most companies deal with the same set of release engineering problems regardless of their size or the tools they use: How should you handle versioning of your packages? Should you use a continuous build and deploy model, or perform periodic builds? How often should you release? What configuration management policies should you use? What release metrics are of interest?

- - -

The release engineer needs to understand the intention of how the code should be built and deployed. The developers shouldn’t build and “throw the results over the fence” to be handled by the release engineers.

- - -

In fact, SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug. Building reliability into development allows developers to focus their attention on what we really do care about—the functionality and performance of their software and systems.

- - -

… when you consider a web service that’s expected to be available 24/7, to some extent, every new line of code written is a liability.

- - -

If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation.

- - -

A point for sure to be perceived negatively by most of the overly ambitious startups:

Every time we say "no" to a feature, we are not restricting innovation; we are keeping the environment uncluttered of distractions so that focus remains squarely on innovation, and real engineering can proceed.

- - -

Toil, complaints, understanding role: Notes from Google's SRE Book (#5)

This chapter was a particularly interesting read which deals with day-to-day non-optimal ways of doing work and it goes unnoticed. It is tiring and gives a false sense of accomplishment.

Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.

- - -

Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.

- There’s a floor on the amount of toil any SRE has to handle if they are on-call. A typical SRE has one week of primary on-call and one week of secondary on-call in each cycle (for discussion of primary versus secondary on-call shifts. It follows that in a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling, which means the lower bound on potential toil is 2/6 = 33% of an SRE’s time. In an 8-person rotation, the lower bound is 2/8 = 25%.

- - -

It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.

- Your career progress will slow down or grind to a halt if you spend too little time on projects.

- - -

We work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.

- - -

If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Other teams may also start expecting SREs to take on such work.

- - -

System behavior, percentiles, SLOs: Notes from Google's SRE Book (#4)

Introduced a good alternative to Averages used in monitoring with practical examples. Usually, everyone knows that taking averages are just not enough but here, I feel, is a solution:

because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.

- for example, measuring response latency of the API does not cover the poor user latency which could be due to a slow page load or problems with the page’s JavaScript.

- - -

We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. … … User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values

- Averaging can hide a large number of slow requests and still be a good average. Percentiles help you to know how much percent of the total requests were served under a value and how much wasn’t. The behavior of an nth-percentile value will tell you the variance in response times. Good/consistent behavior, that is less variation, determines reliability than being really fast at some times and being slow at other.

- Other examples could be, like, instantaneous load than measuring average load and seeing how well a system is serving the real-time load.

Also, having standard templates for various SLIs, like measurement window, how frequently to measure, what all clusters to include, how data is acquired (example, at server side or client side), time to last byte vs the first byte as latency, etc help save effort and confusion.

- - -

Start by thinking about (or finding out!) what your users care about, not what you can measure.

- - -

Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.

- - -

A good SLO is a helpful, legitimate forcing function for a development team.

- - -

Compare the SLIs to the SLOs, and decide whether or not action is needed. … Without the SLO, you wouldn’t know whether (or when) to take action.

- - -