SRE Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What are the responsibilities of a SRE team?

A

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the error budget?

A

First a reliability target has to be established. 100% reliability does not make sense as no normal user would see the difference between 99.99% and 100%, but there is an enormous effort to get the remaining 0.01%.
Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is monitoring via emails a bad idea?

A

Monitoring emails requires human to read & interpret the mails and decide whether an action should be taken. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 3 kinds of valid monitoring output?

A
  • Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
  • Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
  • Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the relevant metrics to describe site reliabilty?

A

Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR)

  • MTTF can be calculated as the arithmetic mean (average) time between failures of a system
  • MTTR represents the average time required to repair a failed service
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can the MTTR be reduced

A

By creating a “Playbook” beforehand defining the relevant steps to be taken in order to find the failures, mitigate them and recover the systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can outages / failures on live systems be reduces / prevented?

A
  • Implementing progressive rollouts
  • Quickly and accurately detecting problems
  • Rolling back changes safely when problems arise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why should the availability target seen as both a minimum and a maximum?

A

We want to reach the availability target in order to offer reliable service. But we don’t wont to exceed it, because that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can reliability be measured?

A
  • Time-based availability

- Aggregate availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is time-based availability?

A

Time-based availability defines the acceptable level of unplanned downtime, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability.

Availability = uptime / (uptime + downtime)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the maximum downtime per year of a system with a availability target of 99.99%?

A

52.6 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is the time-based availability metric not useful in some systems?

A

For large globally distributed systems, it is very likely that at least a subset of traffic is served for a given service somewhere in the world at any given time, which makes the system partially “up” at all times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is aggregate availability?

A

Aggregate availability defines a yield-based metric over a rolling window.
An example would be the request success rate:

availability = successful requests / total requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the advantage of measuring aggregate availability compared to time-based availability?

A

Sometimes not all systems are meant to be “up” all times, especially periodically running jobs or ETL pipelines. Time-based uptime doesn’t make sense on these systems.
However nearly all systems are based on units of work, which can be measured in terms of being successful or unsuccessful, which makes aggregate availability the better metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a good strategy to balance between taking risk (new releases, new features, infrastructural changes, …) and focusing an reliability and availability?

A

Forming an error budget can be a good strategy to make data-based decisions.
The engineering and the SRE teams jointly define a quarterly error budget based on the service’s service level objective. The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

Google’s approach:

  • Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
  • The actual uptime is measured by a neutral third party: our monitoring system.
  • The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
  • As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is SLI?

A

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

Examples are request latency, error rate, throughput, availability, data durability, etc…

17
Q

What is SLO?

A

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.

18
Q

What is SLA?

A

SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

19
Q

What problems could arise when collecting system indicators?

A

Averaging metrics can hide lots of important details, for example outliers or spikes in response times (e.g. 95% of all requests are very fast, but 5% are slow. Averaging the numbers will still yield a pretty fast response time, but it hides the problematic responses).

Therefore most metrics are better thought of as distributions rather than averages. Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case.

20
Q

What questions should a monitoring system address?

A

There are 2 questions to address:

  • What’s broken => Symptom
  • Why is it broken => Cause
21
Q

What is White Box & Black Box Monitoring?

A

The black-box monitoring is symptom-oriented and represents active—not predicted—problems: “The system isn’t working correctly, right now.”
White-box monitoring depends on the ability to inspect the innards of the system, such as logs or HTTP endpoints, with instrumentation. White-box monitoring therefore allows detection of imminent problems, failures masked by retries, and so forth.