Site Reliability Engineering has a meaning. Somewhere along the way, it became a job title handed out to on-call support staff. Let's fix that.
Every week, engineers post something like this:
"So I have role of SRE but 80% of our tasks are support and 20% are responding to alerts... What can I do to be an SRE or apply for an SRE job. I have almost 3 years of experience."
- r/sre, recurring post, every week
This engineer isn't confused about SRE. Their company is. They've been handed a title that means something and a job that means something else.
This isn't a small problem. The misuse of "SRE" is widespread enough that it's actively harming practitioners' careers, poisoning job searches, and letting organizations avoid doing the hard work of actually improving reliability.
SRE was defined by Google's Ben Treynor Sloss in 2003 and documented publicly in 2016. The core idea is simple and specific: software engineers applying engineering discipline to operations problems. The goal is not to add more humans to react faster- it's to eliminate the conditions that make reliability fragile in the first place.
Key texts: the Google SRE Book (free online) and the SRE Workbook (free online).
These are the practices Treynor Sloss articulated when he invented SRE at Google. They are specific, structural, and deliberately hard to fake.
Google's rules were written for Google's scale. Most organizations aren't Google-shaped. These five practices translate the spirit of SRE into something any engineering team can implement and audit against- regardless of size.
The team is incentivized to eliminate sources of toil, not accommodate them. Operational burden flows back to developers when SREs are overloaded, creating shared accountability for reliability.
Service Level Objectives provide a clear, shared picture of how the team's actions affect production. Reliability is not a feeling- it can be measured, and measurements inform decisions.
If production no longer performs at the level required for customer success, reliability improvements move to the top of the roadmap- which may mean pausing feature launches, deprioritizing other work, or both. Conversely, a healthy error budget is permission to take more risk. Both signals matter.
There is sufficient psychological safety to speak openly about the contributing factors to an incident. Postmortems produce systemic change- not scapegoats, not shrugs.
24/7 production coverage exists without destroying the team's wellbeing. On-call load is tracked, capped, and treated as an engineering problem to be solved- not a permanent condition to be endured.
These aren't edge cases. If your day-to-day looks like the right column, you are not doing SRE- regardless of your title.
Use this page. Show it to your manager. Ask: "What are our SLOs?" If there aren't any, you have your diagnosis. Then decide: push for change, or move on.
If the role is primarily support and monitoring, call it Platform Operations or Production Support. Mislabeling wastes everyone's time and burns out the engineers you hire.
Before posting an SRE role, answer three questions: Do we have SLOs? Can SREs freeze releases? Is toil tracked? Build the foundation before you hire for it.
Ask: "What are your SLOs?" and "Has an SRE ever blocked a deploy for reliability reasons?" The answers tell you whether SRE is real here or just a title.
This is a living document. The goal is clarity, not gatekeeping- there are many paths to good reliability engineering. But words mean things, and right now the word "SRE" is doing no one any favors.
Share it. Link it. Translate it. Use it however you want.