Week 9 Flashcards

1
Q

What are the 5 principles of SRE?

A
  1. embrace risk
  2. use service level objectives
  3. use of error budgets
  4. eliminate toil
  5. monitoring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do we mean by ‘embrace risk’ in SRE?

A

Increased reliability may be worse for a service and its users
Maximising stability: reduces feature deployment speed,
increases feature deployment cost
limits feature numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we mean by ‘use service level objectives’ in SRE?

A

Use of service level objectives (SLOs), which are target values for service level indicators (SLIs), such as latency, throughput, or uptime. These are published as service level agreements in contracts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do we mean by ‘use error budgets’ in SRE?

A

Balancing reliability and innovation
1. the expected uptime is agreed
2. the actual uptime is measured
3. the error budget is the difference between 1 and 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do we mean by ‘eliminate toil’ in SRE?

A

Eliminating toil, work that is either
manual, repetitive, automatable, tactical, devoid of enduring value, scalable only linearly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do we mean by ‘Monitoring’ in SRE?

A

Collecting, processing, aggregating and displaying real-time quantitive data about a system such as query counts and types, error counts and types, processing times, server uptimes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 golden signals of monitoring in SRE?

A

Latency - time for a service to respond
Traffic - the number of requests/second
Errors - the number of responses that are errors
Saturation - the fraction of system capacity in use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the SRE practices?

A
  1. practical alerting
  2. being on call
  3. emergency response
  4. managing incidents
  5. blameless postmortems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ‘Practical alerting’ in SRE?

A

A pull model of data collection provides one way to monitor large systems - a monitoring service regularly requests service data, often through a ‘/health’ endpoint.

Hard to scale because the monitor needs to know about every individual service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Alternative to practical alerting in SRE?

A

Practical monitoring, rather than querying every service for health, get them to report in using a push model of data collection to a monitoring service, to logs. Easier as each service only needs to know the monitoring service.
Hard to sync up all the services in timestamps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ‘Being on-call’ in SRE?

A

Coping with incidents within five minutes (for user-facing systems) and 30 minutes for other systems.
SREs spend 50% of their time on purely operational work, 50% of their time on engineering projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ‘emergency response’ in SRE?

A

A proper response when systems break takes a playbook, regular rehearsal, and training. A proactive approach is to deliberately break systems, making changes to their reliability to prevent these failures from recurring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ‘Managing incidents’ in SRE?

A

Incident management limits disruption caused by an incident and restores normal business operations as quickly as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unmanaged vs managed incidents?

A

Unmanaged: characterised by teams fixated on technical problems, with poor communication and much freelancing
Managed: characterised by teams with separate responsibilities, with a clear communication and a plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ‘Blameless Postmortems’ in SRE?

A

Postmortem triggered by:
downtime beyond a threshold
data loss of any kind
intervention by an on-call engineer
resolution time above some threshold
manual incident discovery implying a monitoring failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Common beliefs of devops and sre

A
  1. no more silos (lack of collaboration)
  2. accidents are normal (use them to grow, SRE uses error budget, devops uses feedback loop)
  3. change should be gradual (less risk + automatic testing)
  4. tooling and culture are interrelated
  5. meaurement is crucial