Reclaim SRE- Let's put Engineering back into Site Reliability Engineering

The Problem

SRE in name only

Every week, engineers post something like this:

"So I have an SRE role but 80% of our tasks are support and 20% are responding to alerts... What can I do to be an SRE or apply for an SRE job? I have almost 3 years of experience."

r/sre, every week, ad nauseam

This engineer isn't as confused about SRE as much as their company is. They have a title that means one thing, and responsibilities that are completely different. It feels like a bait and switch.

This misperception of "SRE" has sadly spread faster than the actual practice. That gap is costing engineers their career trajectory- and companies the opportunity to truly invest in better reliability.

The definition

What SRE actually is

SRE was defined by Google's Ben Treynor Sloss in 2003, first articulated publicly in a 2014 SREcon talk, and introduced fully in the 2016 SRE Book. The core idea is simple and specific: apply software engineering to solve operations problems. The goal is not to add more humans to soak up manual effort and incidents- it's to eliminate the sources of those things in the first place.

SRE concerns itself with availability, latency, performance, efficiency, change management, and capacity planning- in addition to monitoring and incident response. Teams that are pigeonholed to only the last two concerns aren't a true SRE practice.

Key texts: the Google SRE Book (free online) and the SRE Workbook (free online).

The original rules

Ben Treynor Sloss's founding practices

These are the practices Treynor Sloss articulated when he invented SRE at Google. They're specific to Google- but the spirit of what he was attempting quickly emerges.

Source: Ben Treynor Sloss, Google - SREcon14

Hire only coders.

Have an SLA for your service.

Measure and report performance against the SLA.

Use Error Budgets and gate launches on them.

Have a common staffing pool for SRE and Developers.

Have excess Ops work flow to the Dev team.

Cap SRE operational load at 50%.

Share 5% of Ops work with the Dev team.

Oncall teams should have at least 8 people at one location, or 6 people at each of multiple locations.

Aim for a maximum of two events per on-call shift.

Do a postmortem for every event.

Postmortems are blameless and focus on process and technology, not people.

SRE practices, distilled

Five practices for any team

Google's rules were written for Google. Most teams aren't running global infrastructure with thousands of engineers. These five practices translate the spirit of SRE for everyone else.

Operational responsibility is shared and managed via automation

The team is incentivized to eliminate sources of toil, not delegate them to someone else.

Customer success is quantitatively measured

Service Level Objectives provide an empirical and shared view of the users' experience and how the team's actions affect it. Reliability is not a feeling- it can be measured, and measurements inform decisions.

Error budgets inform work prioritization

If production no longer performs at the level required for customer success, reliability improvements take top priority- which may mean pausing new feature releases, deprioritizing feature work, or both. Conversely, a healthy error budget is permission to take more risks.

The team learns from failure in a blameless way

We create psychological safety so that we can speak openly about the contributing factors to an incident. Ben Treynor Sloss said it best: focus on process and technology, not people.

On-call rotations are properly staffed and humane

24/7 production coverage exists without burning out engineers. On-call pain points are closely monitored and discussed as an engineering problem to be solved- not a permanent situation to be endured.

The line

SRE vs. not SRE

If your day-to-day work looks like the column on the right, you are not doing SRE- regardless of your title.

This is SRE

Toil reduction through writing code
Defined SLOs with error budgets
Past SLO performance drives roadmap decisions
Consistently enforced policy for SLO violations
Properly staffed on-call rotations
Blameless postmortems that drive continuous improvement
Developers take part in operational responsibilities
SREs on the same salary band and level progression as SWEs

This is not SRE

Coding limited to YAML and IaC tools
Ticket triage and customer troubleshooting
Alerts babysitting with no means to remediate
No SLOs, error budgets, or consistent enforcement
Postmortems that produce no action items
Being the team developers throw Ops work to
On-call burnout treated as normal
SREs on a separate, lower-ceiling track than SWEs

What to do about it

If you're a practitioner

Name the gap

Use this page. Show it to your manager. Ask: "What's our SLO strategy?" If there isn't one, there's your problem. Part of our responsibility as practitioners is to apply for and work at jobs that actually practice SRE.

If you're recruiting

Be honest in the job description

If the role is primarily product support and on-call, reconsider the title before posting. Candidates who know SRE will see through it quickly- and the ones who don't will struggle in the role. Either way, you're paying for the mismatch. SRE is also a senior role by nature- candidates without a background in software engineering or operations rarely have the foundation to succeed in it.

If you're a leader

Build the program before you hire

Before posting an SRE role, answer these questions: Where are we in our SLO journey? Do we enforce SLO violations, yet? Is toil tracked? Are we learning from failure? Build the foundation before you hire for it- or make building it part of the job description.

If you're job searching

Interview the interviewer

Read the job description carefully. Does it mention the above SRE practices? During the interview, ask questions, like: "What are your SLOs?" and "How did you handle your last major incident?" The answers tell you whether SRE is in place, in progress, or just a title.

SRE is a discipline, not just a title. Use the term like it means something- because it does.

You are welcome to share and use this resource (with attribution).

Changelog

v1.0- 20160511 - Initial launch
v1.1- 20160512 - Added compensation parity, link to SRECon talk