SRE Flashcards
What are the responsibilities of a SRE team?
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s)
What is the error budget?
First a reliability target has to be established. 100% reliability does not make sense as no normal user would see the difference between 99.99% and 100%, but there is an enormous effort to get the remaining 0.01%.
Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it.
Why is monitoring via emails a bad idea?
Monitoring emails requires human to read & interpret the mails and decide whether an action should be taken. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
What are the 3 kinds of valid monitoring output?
- Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
- Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
- Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
What are the relevant metrics to describe site reliabilty?
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR)
- MTTF can be calculated as the arithmetic mean (average) time between failures of a system
- MTTR represents the average time required to repair a failed service
How can the MTTR be reduced
By creating a “Playbook” beforehand defining the relevant steps to be taken in order to find the failures, mitigate them and recover the systems
How can outages / failures on live systems be reduces / prevented?
- Implementing progressive rollouts
- Quickly and accurately detecting problems
- Rolling back changes safely when problems arise
Why should the availability target seen as both a minimum and a maximum?
We want to reach the availability target in order to offer reliable service. But we don’t wont to exceed it, because that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.
How can reliability be measured?
- Time-based availability
- Aggregate availability
What is time-based availability?
Time-based availability defines the acceptable level of unplanned downtime, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability.
Availability = uptime / (uptime + downtime)
What is the maximum downtime per year of a system with a availability target of 99.99%?
52.6 minutes
Why is the time-based availability metric not useful in some systems?
For large globally distributed systems, it is very likely that at least a subset of traffic is served for a given service somewhere in the world at any given time, which makes the system partially “up” at all times.
What is aggregate availability?
Aggregate availability defines a yield-based metric over a rolling window.
An example would be the request success rate:
availability = successful requests / total requests
What is the advantage of measuring aggregate availability compared to time-based availability?
Sometimes not all systems are meant to be “up” all times, especially periodically running jobs or ETL pipelines. Time-based uptime doesn’t make sense on these systems.
However nearly all systems are based on units of work, which can be measured in terms of being successful or unsuccessful, which makes aggregate availability the better metric.
What is a good strategy to balance between taking risk (new releases, new features, infrastructural changes, …) and focusing an reliability and availability?
Forming an error budget can be a good strategy to make data-based decisions.
The engineering and the SRE teams jointly define a quarterly error budget based on the service’s service level objective. The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Google’s approach:
- Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
- The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
- As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.