Site Reliability Engineering means something.
Somewhere along the way, it became synonymous with monitoring and incident response.
This page was built to help reclaim the original definition.
Every week, engineers post something like this:
"So I have an SRE role but 80% of our tasks are support and 20% are responding to alerts... What can I do to be an SRE or apply for an SRE job? I have almost 3 years of experience."
r/sre, every week, ad nauseam
This engineer isn't as confused about SRE as much as their company is. They have a title that means one thing, and responsibilities that are completely different. It feels like a bait and switch.
This misperception of "SRE" has sadly spread faster than the actual practice. That gap is costing engineers their career trajectory- and companies the opportunity to truly invest in better reliability.
SRE was defined by Google's Ben Treynor Sloss in 2003, first articulated publicly in a 2014 SREcon talk, and introduced fully in the 2016 SRE Book. The core idea is simple and specific: apply software engineering to solve operations problems. The goal is not to add more humans to soak up manual effort and incidents- it's to eliminate the sources of those things in the first place.
SRE concerns itself with availability, latency, performance, efficiency, change management, and capacity planning- in addition to monitoring and incident response. Teams that are pigeonholed to only the last two concerns aren't a true SRE practice.
Key texts: the Google SRE Book (free online) and the SRE Workbook (free online).
These are the practices Treynor Sloss articulated when he invented SRE at Google. They're specific to Google- but the spirit of what he was attempting quickly emerges.
Google's rules were written for Google. Most teams aren't running global infrastructure with thousands of engineers. These five practices translate the spirit of SRE for everyone else.
The team is incentivized to eliminate sources of toil, not delegate them to someone else.
Service Level Objectives provide an empirical and shared view of the users' experience and how the team's actions affect it. Reliability is not a feeling- it can be measured, and measurements inform decisions.
If production no longer performs at the level required for customer success, reliability improvements take top priority- which may mean pausing new feature releases, deprioritizing feature work, or both. Conversely, a healthy error budget is permission to take more risks.
We create psychological safety so that we can speak openly about the contributing factors to an incident. Ben Treynor Sloss said it best: focus on process and technology, not people.
24/7 production coverage exists without burning out engineers. On-call pain points are closely monitored and discussed as an engineering problem to be solved- not a permanent situation to be endured.
If your day-to-day work looks like the column on the right, you are not doing SRE- regardless of your title.
Use this page. Show it to your manager. Ask: "What's our SLO strategy?" If there isn't one, there's your problem. Part of our responsibility as practitioners is to apply for and work at jobs that actually practice SRE.
If the role is primarily product support and on-call, reconsider the title before posting. Candidates who know SRE will see through it quickly- and the ones who don't will struggle in the role. Either way, you're paying for the mismatch. SRE is also a senior role by nature- candidates without a background in software engineering or operations rarely have the foundation to succeed in it.
Before posting an SRE role, answer these questions: Where are we in our SLO journey? Do we enforce SLO violations, yet? Is toil tracked? Are we learning from failure? Build the foundation before you hire for it- or make building it part of the job description.
Read the job description carefully. Does it mention the above SRE practices? During the interview, ask questions, like: "What are your SLOs?" and "How did you handle your last major incident?" The answers tell you whether SRE is in place, in progress, or just a title.
SRE is a discipline, not just a title. Use the term like it means something- because it does.
You are welcome to share and use this resource (with attribution).