SRE Flashcards

Question 1

Q

What are the responsibilities of a SRE team?

Answer

A

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s)

Question 2

Q

What is the error budget?

Answer

A

First a reliability target has to be established. 100% reliability does not make sense as no normal user would see the difference between 99.99% and 100%, but there is an enormous effort to get the remaining 0.01%.
Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it.

Question 3

Q

Why is monitoring via emails a bad idea?

Answer

A

Monitoring emails requires human to read & interpret the mails and decide whether an action should be taken. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

Question 4

Q

What are the 3 kinds of valid monitoring output?

Answer

A

Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.

Question 5

Q

What are the relevant metrics to describe site reliabilty?

Answer

A

Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR)

MTTF can be calculated as the arithmetic mean (average) time between failures of a system
MTTR represents the average time required to repair a failed service

Question 6

Q

How can the MTTR be reduced

Answer

A

By creating a “Playbook” beforehand defining the relevant steps to be taken in order to find the failures, mitigate them and recover the systems

Question 7

Q

How can outages / failures on live systems be reduces / prevented?

Answer

A

Implementing progressive rollouts
Quickly and accurately detecting problems
Rolling back changes safely when problems arise

Question 8

Q

Why should the availability target seen as both a minimum and a maximum?

Answer

A

We want to reach the availability target in order to offer reliable service. But we don’t wont to exceed it, because that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.

Question 9

Q

How can reliability be measured?

Answer

A

Time-based availability

- Aggregate availability

Question 10

Q

What is time-based availability?

Answer

A

Time-based availability defines the acceptable level of unplanned downtime, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability.

Availability = uptime / (uptime + downtime)

Question 11

Q

What is the maximum downtime per year of a system with a availability target of 99.99%?

Answer

A

52.6 minutes

Question 12

Q

Why is the time-based availability metric not useful in some systems?

Answer

A

For large globally distributed systems, it is very likely that at least a subset of traffic is served for a given service somewhere in the world at any given time, which makes the system partially “up” at all times.

Question 13

Q

What is aggregate availability?

Answer

A

Aggregate availability defines a yield-based metric over a rolling window.
An example would be the request success rate:

availability = successful requests / total requests

Question 14

Q

What is the advantage of measuring aggregate availability compared to time-based availability?

Answer

A

Sometimes not all systems are meant to be “up” all times, especially periodically running jobs or ETL pipelines. Time-based uptime doesn’t make sense on these systems.
However nearly all systems are based on units of work, which can be measured in terms of being successful or unsuccessful, which makes aggregate availability the better metric.

Question 15

Q

What is a good strategy to balance between taking risk (new releases, new features, infrastructural changes, …) and focusing an reliability and availability?

Answer

A

Forming an error budget can be a good strategy to make data-based decisions.
The engineering and the SRE teams jointly define a quarterly error budget based on the service’s service level objective. The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

Google’s approach:

Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
The actual uptime is measured by a neutral third party: our monitoring system.
The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Question 16

Q

What is SLI?

Answer

Study These Flashcards

A

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

Examples are request latency, error rate, throughput, availability, data durability, etc…

Question 17

Q

What is SLO?

Answer

Study These Flashcards

A

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.

Question 18

Q

What is SLA?

Answer

Study These Flashcards

A

SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

Question 19

Q

What problems could arise when collecting system indicators?

Answer

Study These Flashcards

A

Averaging metrics can hide lots of important details, for example outliers or spikes in response times (e.g. 95% of all requests are very fast, but 5% are slow. Averaging the numbers will still yield a pretty fast response time, but it hides the problematic responses).

Therefore most metrics are better thought of as distributions rather than averages. Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case.

Question 20

Q

What questions should a monitoring system address?

Answer

Study These Flashcards

A

There are 2 questions to address:

What’s broken => Symptom
Why is it broken => Cause

Question 21

Q

What is White Box & Black Box Monitoring?

Answer

Study These Flashcards

A

The black-box monitoring is symptom-oriented and represents active—not predicted—problems: “The system isn’t working correctly, right now.”
White-box monitoring depends on the ability to inspect the innards of the system, such as logs or HTTP endpoints, with instrumentation. White-box monitoring therefore allows detection of imminent problems, failures masked by retries, and so forth.

SRE Flashcards

(21 cards)