SRE in practice: 5 insights from Google's experience

24 Jun

Matthew Skelton, CEO/CTO at Conflux, co-author of Team Topologies, shares key insights into Site Reliability Engineering (SRE) based on Google’s experience achieving reliability at scale. In a discussion with David Ferguson, Global Lead for Customer Reliability Engineering at Google, Matthew highlights that to establish a SRE function, you need to build the right organizational environment.

Key takeaways:

A separate SRE team is always optional at Google
Software product teams need to care about operability from Day 1
Many of the benefits of the SRE approach do not need a separate SRE team
Choosing business-relevant SLOs is one of the most important aspects of SRE
Reliability is about more than just avoiding downtime

As Site Reliability Engineering (SRE) becomes more commonplace across IT organisations, what lessons can you learn from Google, one of the originators of SRE?

I have recently been helping several organizations to understand and adopt SRE practices; as part of this, I spoke to David Ferguson, EMEA Lead for Customer Reliability Engineering at Google, to understand how SRE actually works at Google.

Many organizations are facing difficult challenges building reliable services. They often find that just re-naming an Operations team "SRE" doesn't meaningfully solve their problem. And even if they have staff with SRE skills they need to create an organizational environment to set them up for success.

Google's underlying motivation and constraints in creating SRE can provide useful insights in how other organizations might approach deployment of SRE in their own environment.

A separate SRE team is always optional at Google

One of the most important aspects of SRE at Google is that only some services get SRE involvement. That’s correct: SRE is optional. Software development teams cannot assume they will get SRE support for their software.

“Make it a privilege to have SRE involvement, not mandatory.”

— David Ferguson, Global Lead for Customer Reliability Engineering at Google

Think of it like this: a software engineering team should expect to build and run its software entirely by itself, possibly forever. That is the default position at Google. If the service never reaches the scale to merit SRE involvement, then the software engineering team must continue to build and run the software in Production. This is shown as scenario 1 in the diagram below (where “SA” stands for “Stream-aligned”, the name for cross-functional software engineering teams we use in the book I co-authored, Team Topologies).

As a software service begins to scale and need some operational expertise to help improve its resilience, an SRE team might help the Stream-aligned software engineering team to understand scaling, reliability, observability and how to use platform components that it manages - scenario 2 in the diagram below. However, at this point, the SRE team is not yet running the software in Production.

Finally, if the Stream-aligned software engineering team can persuade the SRE team that their software is operationally ready, the SRE team might choose to partner with the software engineering team to run the software in Production, ensuring reliability as the software scales (scenario 3 below).

IMAGE Fig_4.3_r7--edited--Team-Topologies.png - taken from the book Team Topologies by Matthew Skelton and Manuel Pais, IT Revolution Press, 2019 — *Scenario 1 shows the default position of a Stream-aligned software engineering team building and running their software by themselves.*
*Scenario 2 shows an SRE team helping the Stream-aligned software engineering team to understand operability better.*
*Scenario 3 shows the SRE team taking responsibility for running the software in Production.*

IMAGE taken from the book Team Topologies by Matthew Skelton and Manuel Pais, IT Revolution Press, 2019

Software engineering teams must demonstrate a high degree of operability and service readiness before an SRE team will partner with them to help improve the reliability. SRE teams are free to turn down a request for help from a software engineering team if they think the operational burden would be too high or if there are no clear opportunities for engineering projects. Is your SRE team empowered to do that?

Software product teams need to care about operability from Day 1

A software engineering team at Google cannot insist on the help of an SRE team; instead, they must demonstrate that their software is ready for Production by a continual focus on operability. Only by showing that their software is operationally ready will an SRE team be persuaded to partner with the software engineering team to help with reliability as the software scales.

As part of this focus on operability, it’s important to make sure that organizational incentives (including pay and career progression) help to drive the right behaviors.

“...we incentivize software engineers to get things into and used in production (not just checked-in and passing tests). We incentivize site reliability engineers to actually identify and do engineering work on services. Software engineers still need to have some contact with Production ... through looking after things in early-life, secondments, business hours oncall, [etc.]...
”

— David Ferguson, Global Lead for Customer Reliability Engineering at Google

In this way, Google avoids a “hard boundary” between software engineering teams and SRE teams by making sure that software engineers have incentives to understand Production and SREs have incentives to understand the business context of the software, thus encouraging shared ownership.

Many of the benefits of the SRE approach do not need a separate SRE team

A key part of the SRE-based discipline is the use of the Error Budget to control when to focus on improving reliability and operability. But the Error Budget model can work without a separate SRE team - it just needs discipline from the software team and product owner to “play by the rules” of the Error Budget and stop deploying new features when the service availability has been breached for that month.

David Ferguson of Google calls the Error Budget mechanism “an appropriate control process for agile teams”. He describes the Error Budget discipline like this:

1. Define what matters to your users

2. Measure it and define the guard-rails that you care about

3. Decide what you are going to do when you hit the guard-rails

4. When you hit the guard-rails actually do the thing you said you would

5. Be transparent with your data and your actions.

When outlined like this, it’s clear that a single Stream-aligned software team can be empowered to enact Error Budgets without a separate SRE team. What we’re doing here is keeping a laser-like focus on what matters to the end-user and making sure that we know when the end-user experience is starting to degrade.

Choosing business-relevant SLOs is one of the most important aspects of SRE

At the heart of the SRE approach is the Service Level Objective (SLO) for the application or service that is being run by the SRE team. An SLO is a performance or availability target for that service: a degree of performance or availability that meets business expectations at an acceptable cost.

Conformance to SLOs is measured through the use of a Service Level Indicator (SLI), or perhaps several SLIs. An SLI is a single quantitative measure of some aspect of the behaviour or performance of a service or system. SLIs are generally closely tied to characteristics that users of the service care about: response time (for web applications), durability (for data persistence), error rate, or perhaps the availability of a multi-step flow.

Synthetic transaction monitoring is good to have in place because the monitoring is driven from external locations, generating a similar experience to that seen by end-users. However, teams at Google often go beyond synthetic monitoring.

“With many services we are actually scoring each and every interaction with service. Synthetic monitoring is useful because it represents an expected load on the system but it rarely covers the full breadth of interactions that matter (particularly when you have horizontally scaled components where the synthetics may tend to only probe one instance of each clustered service).”

— David Ferguson, Global Lead for Customer Reliability Engineering at Google

This focus on the ways in which users experience the running services and applications is a key aspect of the SRE approach. Instead of simply monitoring “uptime” (whether a process is running or a webpage is present), SLOs driven by SLIs encourage an attention to the quality of the interaction experienced by the user, which is ultimately one of the most important criteria for successful software.

Reliability is about more than just avoiding downtime

If a service is “down”, then it’s fairly straightforward to understand the impact on users: they cannot use the system at all. However, end-users can be affected by services performing slowly or intermittently or in unusual ways at certain times. These more nuanced aspects of reliability are crucial to measure as part of SRE.

“Our approach isn’t just about hard downtime. An SLO of 99.9% over a month means that around 1 in a 1000 requests fail (or go too slow) over that period. This approach weights in the real-time aspects of reliability - temporary slowness during garbage collection or peak time. It also biases towards busy-hour - if 10% of your traffic is at busy hour, then you will burn error budget much faster if you have an outage during that time.

It also makes it clearer that everyone has a part to play. When you just focus on outages, it tends to look like operations fault and we convince ourselves that it can’t/won’t happen again. When you look at request-by-request performance, you’ll quickly see that you never really have 100% and you don’t want to spend the time or money getting to 100%
”

— David Ferguson, Global Lead for Customer Reliability Engineering at Google

Modern large-scale software needs high-quality metrics on performance covering request/response time, latency, throughput, variability, outliers, data persistence, and more. These things all contribute to the reliability of the software as seen by end-users, so we need to measure and understand all these dimensions.

David Ferguson again: “done right, [with SRE] we are just helping the organisation keep to the promises/decisions it made about how good it wanted its products to be”.

Summary: leverage the underlying dynamics of SRE, not just the name

So what can we learn from SRE at Google? The SRE approach can clearly be a key part of success with large-scale cloud software. However, simply adding a separate SRE team misunderstands how Google actually implements SRE. In fact, SRE teams are optional at Google and software engineering teams must work hard to persuade SRE teams that their software has good operability.

“avoid SRE being mandatory. Keep it as a privilege to have SREs helping you with your thing. If you don’t care about your reliability, they shouldn’t have to either.”

— David Ferguson, Global Lead for Customer Reliability Engineering at Google

So, keep SRE teams as a privilege for the most deserving services, define your Error Budget and use that as a control mechanism for software teams, keep a relentless focus on the operability of the software you’re building, and choose SLIs and SLOs wisely to make sure that you’re measuring what actually users really care about.

To improve your approaches to SRE and operability, see the The Site Reliability Workbook from Google which includes practical tips on SRE implementation based on work with Google’s customers, and also the Team Guide to Software Operability from Conflux Books, which contains practical, team-focused techniques for enhancing operability in modern software.

Whether you're seeking clarity on your approach or ready for comprehensive transformation, Conflux supports organizations through four engagement types:

Assessments & expert sense-checks: for organizations seeking clarity on their approach
Workshops & training: for leadership teams building fast flow capabilities
Leadership ‘right-hand’ support: for individuals navigating organizational change
Driving transformations: for teams and organizations ready for comprehensive change

Book a 90-minute discovery call with our experts. We'll explore your specific challenges, calculate your ROI potential, and build your transformation roadmap.

Book your discovery call

Matthew Skelton - Conflux

CEO/CTO and Founder of Conflux

Matthew Skelton is one of the foremost leaders in modern organizational dynamics for fast flow, drawing on Team Topologies, Adapt Together™, and related practices to support organizations with transformation towards a sustainable fast flow of value and true business agility via holistic innovation.

Co-author of the award-winning and ground-breaking book Team Topologies, Founder and CEO/CTO at Conflux, and director of core operations at the non-profit Team Topologies, Matthew brings a humane approach to organizational effectiveness.

LinkedIn: matthewskelton / Website: matthewskelton.com

https://confluxhq.com

SRE in practice: 5 insights from Google's experience

As Site Reliability Engineering (SRE) becomes more commonplace across IT organisations, what lessons can you learn from Google, one of the originators of SRE?

A separate SRE team is always optional at Google

Software product teams need to care about operability from Day 1

Many of the benefits of the SRE approach do not need a separate SRE team

Choosing business-relevant SLOs is one of the most important aspects of SRE

Reliability is about more than just avoiding downtime

Summary: leverage the underlying dynamics of SRE, not just the name

Making the most of your people: key takeaways from the book ‘The Fearless Organization’