Navigating the challenges of site reliability engineering
Avoid these common pitfalls when implementing site reliability engineering in your organisation
Site reliability engineering (SRE) is a specific approach for DevOps, with specific engineer skills, approaches, and frameworks. It is not as flexible as DevOps, but it is highly consistent if implemented correctly. Unfortunately, that last point is where it often goes wrong.
As the Head of Cloud & DevOps Solutions at Ten10, I will discuss the challenges of SRE and how it is different to DevOps, I’ll identify the main pitfalls I have seen limit organisations so you can avoid them.
How is SRE different to DevOps?
When talking about the people in DevOps I like to articulate it this way: there are people who have big DEV and small ops (DEVops) skills. These are software engineers carrying out DevOps activities.
Then there are people with small dev and big OPS (devOPS) skills. These are typically people with system administration skills and expertise around configuration (YAML-based engineering).
Then there are DevOps engineers who have transitioned from one of the other two states and ended up equally happy writing code as configuring networks.
With that in mind, site reliability engineers should come from the DEVops and DevOps sets of people. This is a limited pool, but it is important to fulfil their purpose. When it comes to SRE the engineers must be comfortable to fix a wide variety of issues. Take, for example, a memory leak – there are a couple of ways to solve and mitigate this. The correct solution is to fix the issue in code: this requires the ability to write and release code. To mitigate the issue you may change the infrastructure capacity or balance the load across more servers.
With SRE, the focus is on improving the overall service. This means that while some of the team are working on the mitigation steps, the others will be resolving the core issue. When compared to a typical DevOps team or a Platform Engineering team, SRE puts a larger focus on fixing issues that crop up at their root cause, not just working around them.
In addition to the people side of SRE, some measures and approaches are enforced at a corporate level that ensure the end users of the service feel as little impact as possible and that sufficient effort is focused on reducing human interaction where necessary to increase consistency.
How is SRE implemented?
There are several differing opinions on this and I have seen as many as seven different implementation models for SRE from embedded to organisational-wide teams. To avoid a ‘painting the bike shed’ conversation, let’s just agree it is a dedicated team that sits alongside development and is empowered to stop the release process.
The goal of the team is to enable the maximum velocity possible while maintaining the service level objectives (SLOs). This is key as everything is geared towards protecting the service level agreements (SLAs) the business has signed with its customers.
The SLOs are made up of service level indicators (SLIs). These indicators could be response time, memory utilisation, or other typical monitoring measures. The key is to select the SLIs that make it possible to determine if your service is working: these become your SLOs.
If your SLOs are breached, release stops until you understand the issue and resolve it (this should be before an SLA breach). If your SLA is 99.9% your SLOs should be less than this (99.7% for example), this allows you some overhead in case your indicators are not correct or something unexpected is going on. Your SLOs are triggered by your SLIs breaching their agreed thresholds.
In addition to the SLIs, SLOs, and SLAs, there are rules around technical debt. In SRE there is an easy way to identify technical debt – it isn’t just bugs or optimisations, it’s toil. Any time a human needs to get involved with the day-to-day, that is a bug. This mindset and classification creates significantly more technical debt than most organisations are accustomed to dealing with. One of the key measures for an SRE team is maintaining a low toil percentage for the team.
Finally, there is error budget. This is the difference between your SLAs and 100%, so if your SLAs are 99.95% you have an error budget of 0.05%. You can spend this as you see fit but if you breach your error budget, you stop releasing until the service has become more stable. For example, you may switch from one technology to another to improve stability. This will help reduce future error budget spend, then, you can continue releasing.
Avoiding the common pitfalls
A lot of organisations like to mimic others. This is natural but mimicking something without full understanding results in sub-standard solutions and another failed implementation. There are three big pitfalls I regularly see when organisations implement SRE:
- Hiring the wrong people: Hiring strong DEVops and DevOps people is hard. It’s less than a third of the total market, meaning they are more expensive. But without these skills, you are not able to implement change as it should be. A site reliability engineer is meant to be able to write code fixes to problems. Imagine you have a memory leak in an application – if your SRE team can’t fix that, you don’t have an SRE team. You have a DevOps team following SRE principles poorly.
- Release blocking: A lot of organisations do not allow the SRE team to block releases. The business overrules the release. If you have spent your error budget you are done. The next release must be one to increase the stability of the platform in some way. Without this, you are simply increasing risk with no strategy to resolve, and hope is not a strategy.
- Inadequate toil reduction: Toil is the force that destroys your team’s ability to solve problems. By reducing the toil to as low as possible and ensuring that time is spent reducing toil every week, more time is being created to focus on stability which allows for more frequent releases.