umesh k singla: 2019

I was lucky to get the chance to attend the Google Summer of Code Mentor Summit two years in a row. It was hosted in Munich this year, first time outside Silicon Valley, from October 17 to October 20. I was a student participant with MacPorts organization in 2017, and later took the role of org admin in 2018 and co-mentor in 2019.

If you are not familiar with the Google Summer of Code, it is a global program focused on introducing students to open-source software development. Student participants are paired with a mentor from the participating organizations, gaining exposure to real-world software development and techniques. Students have the opportunity to spend the break between their school semesters earning a stipend while working in areas related to their interests. Google invites two mentors each year in autumn for an unconference.

MacPorts is a package manager for macOS which tries to simplify the packaging and installation of open-source software on Mac systems. It is the only package management system that supports macOS as far back as 10.5 for a majority of the packages. Luckily or unluckily, Homebrew wasn't there.

Together with Mojca Miklavec, I mentored Arjun Salyan who did a great job at implementing the long-standing feature in MacPorts to be able to view ports' information on a web interface along with open issues, build history, and user installation statistics. We would have a two-hour hangout call on Sunday afternoons the last summer to discuss the progress. We left a huge list of improvements and features for another summer maybe. GitHub Issues and the dev mailing list made it convenient for the community to review code and give feedback alongside the development.

My blurred shot of Allianz Arena

Google arranged a scavenger hunt across the city of Munich and provided us with all-day public transit passes. Munich is a very well-connected city and is home to centuries-old buildings even in the middle of the city. You can basically see glass elevators in the middle of the roads.

The two remaining days were reserved for unconference sessions on various topics. I attended a number of different sessions but also used the “hallway track” to get in touch with others. I had a far better experience this time than my first time at the summit in 2018. It definitely felt good to discuss python packaging issues I am having at work with Francesco Bonazzi who works for SymPy and also at Max Planck Institute in Germany on symbolic computing. I also got feedback (or an enhancement request I'd say) on my migration project from 2017 from a senior developer at RTEMS.

There were multiple notable sessions. One of them discussed the possibility of having a GSoC-like program for documentation and technical writers, and the challenges behind it. How people who are not accustomed to coding practices and version control can leverage Github.

A set of sessions I attended was around the growth of the smaller open-source communities, how they can finance themselves and employ people, and how more permissible and less vague software licenses may help. This might need convincing large companies to use open-source. A big problem we face at MacPorts is the availability of free hardware for testing pull requests and building binaries because of the expensive nature of Macs. It's been solved to a large extent by the recent release of GitHub Actions and Azure pipelines, providing 6 hours of build time which is a significant increase from 50 mins at Travis CI but is still a hurdle in bringing new contributors on board.

People associated with universities and research labs had a very interesting series of sessions on open science. It does help to know how the work done in physics, biology or neuroscience needs to be reproducible and accessible to practically realize its impact. There is a huge lack of practice/awareness among science academia to document, code, and release their collected datasets and experiment details in a manner that it can be reused by scientific communities outside their research groups. Scanning the brains of 20 people from North America for hints of drug use is largely ineffective unless it can be reproduced across demographics.

My favorite session was a loose discussion on the importance of #offtopic or #random channels hosted by coala's Maximilian Scholz. By sharing content unrelated to work and just "fooling around", it helps the developers to know more about the personality of the community, and the senior members and hence, more potential to connect, or in some cases, lifelong friendships.

I noticed a number of projects working on building apps for helping younger kids learn to code through mobile games. (I think?) I made friends with people from NumFOCUS (which supports scipy, numpy, pandas among many other scientific computing tools) and Zulip (an open-source alternative to Slack).

I found it fascinating how the Google OSPO team managed to get 330 people from 42 different countries to agree on the discussion topics. They would put up a board with time slots and rooms, and people would suggest topics and indicate interests using mere post-it notes and dot stickers. No exhausting long discussions and voting procedures just for deciding the topics.

I thoroughly enjoyed the evenings and discussions with other open-source projects. Google had authentic Bavarian cuisine arranged for us. This certainly included the beer, of which we managed to finish 3 whole barrels on the first evening itself. I cannot forget the chocolate room, full of the weirdest chocolates which people brought from all corners of the world.

SREs are on-call for the services they support. The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems. These problems, which typically fall in the operational domain, exist at a scale that would be intractable without software engineering solutions.

- - -

The quantity of on-call can be calculated by the percent of time spent by engineers on on-call duties. The quality of on-call can be calculated by the number of incidents that occur during an on-call shift.

- - -

Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month. For dual-site teams, a reasonable minimum size of each team is six, both to honor the 25% rule and to ensure a substantial and critical mass of engineers for the team.

- - -

We’ve found that on average, dealing with the tasks involved in an on-call incident—root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs—takes 6 hours. It follows that the maximum number of incidents per day is 2 per 12-hour on-call shift. In order to stay within this upper bound, the distribution of paging events should be very flat over time,

- - -

Google offers time-off-in-lieu or straight cash compensation, capped at some proportion of overall salary. The compensation cap represents, in practice, a limit on the amount of on-call work that will be taken on by any individual.

- - -

To make sure that the engineers are in the appropriate frame of mind to leverage the latter mindset, it’s important to reduce the stress related to being on-call.

- - -

And the most important point relevant to the companies trying to monitor their services at any scale here:
... when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.

- - -

Paging alerts should be aligned with the symptoms that threaten a service’s SLOs. All paging alerts should also be actionable. Low-priority alerts that bother the on-call engineer every hour (or more frequently) disrupt productivity, and the fatigue such alerts induce can also cause serious alerts to be treated with less attention than necessary.

Sometimes the changes that cause operational overload are not under the control of the SRE teams. For example, the application developers might introduce changes that cause the system to be more noisy, less reliable, or both. In this case, it is appropriate to work together with the application developers to set common goals to improve the system.

In extreme cases, SRE teams may have the option to "give back the pager"—SRE can ask the developer team to be exclusively on-call for the system until it meets the standards of the SRE team in question.

- - -

... while knowledge gaps are discovered only when an incident occurs. ... To counteract this eventuality, SRE teams should be sized to allow every engineer to be on-call at least once or twice a quarter, thus ensuring that each team member is sufficiently exposed to production.
- - -

umesh k singla

GSoC Mentor Summit 2019

25% on-call rule, paging, actionable alerts: Notes from Google's SRE Book (#8)