Over the past year or so, we have spoken with quite a few prospective users who have defined their responsibilities as site reliability engineering (SRE). If, like me, you’re not familiar with the term, I’ll save you the Google search. SRE is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. Practitioners aim to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” And its origins can also be traced back to 2003 and Google when Ben was hired to lead software engineers to run a production environment.
The site reliability engineering footprint at Google is now larger than 1,500 engineers. Many products have small to medium sized SRE teams supporting them, though not all products do. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm, including ServiceNow, Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, Target, IBM, Xero, Oracle, Zalando, Acquia, and GitHub.
The people we talk to on a daily basis are typically charged with operational management of their company’s cloud infrastructure, and thus governing and controlling costs (that’s where we come in). I got to wondering, how is this approached differently by, say, a site reliability engineer vs. someone who labels himself as “DevOps”?
How Does Site Reliability Engineering Compare to DevOps?
In simple terms, the difference between SREs and DevOps seems clear based on our conversations with folks. SREs are engineers focused on production environments, while DevOps is a philosophy as well as a role. DevOps folks are definitely less concerned with production vs. non-production, and more concerned with the overall cloud management and operations. Side note, DevOps was coined around 2008, so an SRE actually predates a DevOps engineer.
A site reliability engineer (SRE) will spend up to 50% of their time doing “ops” related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a highly skilled system administrator with knowledge of code and automation.
When I first encountered it, site reliability engineering just seemed like another buzzword to replace “IT” or “Ops”. As I read more on it, I understand that it’s more about the people and the process and less about the technology. There is rarely a mention of the underlying infrastructure or tools, and it seems like the main requirement is just the desire to improve. With that, you can align your development and operations (funny, right – DevOps) around the discipline of SRE.
Should Your Company Implement a Site Reliability Engineering Approach?
So while all the hype is around implementing DevOps in your organization, should you really be adopting the idea of site reliability engineering? It certainly makes sense based on the name alone, as “site reliability” is synonymous with “business availability” in our modern internet-connected culture. Any downtime for your service or application means lost revenue and dissatisfied customers, which means the business takes a hit. Using site reliability engineering to keep things running smoothly, while employing DevOps principles to improve those smooth-running processes, seems to be the best combination to really empower your company.