Week 9 - Site Reliability Engineering Flashcards

1
Q

What does SRE stand for?

A

Site Reliability Engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Site Reliability Engineering do?

A

Work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name 5 principles of Site Reliability Engineering.

A

Embrace risk

Use service level objectives

Use of error budgets

Eliminate toil

Monitoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does maximising stability cause?

A

A reduce in feature deployment speed

An increase in feature deployment cost

Limits feature numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does SLO stand for?

A

Service Level Objectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does SLI stand for?

A

Service Level Indicators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are SLOs the target values for?

A

Service Level Indicators (SLIs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give 3 examples of SLIs.

A

Latency, throughput, uptime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does publishing SLOs as SLAs do?

A

Sets expectations about how a service will perform, reducing unfounded complaints about its performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does SLA stand for?

A

Service Level Agreements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the error budget calculated?

A

The expected uptime is agreed (1)

The actual uptime is measured (2)

The error budget is the difference between (1) and (2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is the error budget useful?

A

It says how much risk is allowed within a quarter. Once the agreed-upon number is exceeded, the team shifts its focus from the development of updates to improving reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What type of work counts as toil?

A

Manual (issuing commands, running scripts)
Repetitive (done over and over again)
Automatable (could be designed away)
Tactical (interrupt-driven and tactical)
Devoid of enduring value (nothing gets better)
Scalable only linearly (with traffic volume)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a typical goal for toil?

A

To keep toil below 50% of each Site Reliability Engineer’s time - at least 50% should be spent on engineering project work that will either reduce future toil or add service features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Monitoring involves collecting, processing, aggregating and displaying real-time quantitative data about a system. What sort of data could this be?

A

Query counts and types

Error counts and types

Processing times

Server uptimes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the four golden signals of monitoring?

A

Latency (the time for a service to respond)

Traffic (the number of requests/second)

Errors (the number of responses that are errors)

Saturation (the fraction of system capacity in use)

17
Q

Name 5 practices of SRE.

A

Blameless postmortems
Managing incidents
Emergency response
Being on-call
Practical alerting

18
Q

What does a pull model of data collection do?

A

Provides one way to monitor large systems - a monitoring device regularly requests service data, often through a “/health” endpoint. Hard to scale as the monitoring service must know about every individual service

19
Q

What does a push model of data collection do?

A

Provides another way to monitor large systems - a service regularly sends data to a monitoring service (perhaps to a flat file (logs) or to a time-series database (metrics). Easier to scale as every individual service only needs to know about the monitoring service

20
Q

How long should it take an SRE to cope with a user-facing incident?

A

5 minutes

21
Q

How long should it take an SRE to cope with systems incidents other than user-facing ones?

A

30 minutes

22
Q

What does a proper response take?

A

A playbook, regular rehearsal and training

23
Q

What does a proactive approach to emergency response involve?

A

Deliberately breaking systems to watch how they fail, and make changes to improve reliability and prevent failures from recurring

24
Q

What does incident management do?

A

Limits the disruption by an incident and restores normal business operations as quickly as possible

25
Q

What 2 things can an incident be?

A

Unmanaged - characterised by teams fixated on technical problems, with poor communication and much freelancing

Managed - characterised by teams with separate responsibilities, clear communication and a plan

26
Q

What can a postmortem be triggered by?

A

Downtime beyond a certain threshold

Data loss of any king

Intervention by an on-call engineer

Resolution time above some threshold

Manual incident discovery implying a monitoring failure

27
Q

What is a blameless postmortem?

A

Focus is on identifying the causes of the incident, without indicting any individual - there is no counter-productive finger pointing or shaming of individuals

28
Q

Name 5 common tenets of DevOps and SRE.

A

No more silos

Accidents are normal

Change should be gradual

Tooling and culture are interrelated

Measurement is crucial

29
Q

What do silos lead to?

A

A lack of collaboration, and to incentives for purely local optimisation