This chapter was a particularly interesting read which deals with day-to-day non-optimal ways of doing work and it goes unnoticed. It is tiring and gives a false sense of accomplishment.
Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.
- - -
Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.
- There’s a floor on the amount of toil any SRE has to handle if they are on-call. A typical SRE has one week of primary on-call and one week of secondary on-call in each cycle (for discussion of primary versus secondary on-call shifts. It follows that in a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling, which means the lower bound on potential toil is 2/6 = 33% of an SRE’s time. In an 8-person rotation, the lower bound is 2/8 = 25%.
- - -
It’s fine in small doses, and if you’re happy with those small doses, toil is not a problem. Toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.
- Your career progress will slow down or grind to a halt if you spend too little time on projects.
- - -
We work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.
- - -
If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE. Other teams may also start expecting SREs to take on such work.
- - -
No comments:
Post a Comment