Week 9 Flashcards

Question 1

Q

What are the 5 principles of SRE?

Answer

A

embrace risk
use service level objectives
use of error budgets
eliminate toil
monitoring

Question 2

Q

What do we mean by ‘embrace risk’ in SRE?

Answer

A

Increased reliability may be worse for a service and its users
Maximising stability: reduces feature deployment speed,
increases feature deployment cost
limits feature numbers

Question 3

Q

What do we mean by ‘use service level objectives’ in SRE?

Answer

A

Use of service level objectives (SLOs), which are target values for service level indicators (SLIs), such as latency, throughput, or uptime. These are published as service level agreements in contracts.

Question 4

Q

What do we mean by ‘use error budgets’ in SRE?

Answer

A

Balancing reliability and innovation
1. the expected uptime is agreed
2. the actual uptime is measured
3. the error budget is the difference between 1 and 2

Question 5

Q

What do we mean by ‘eliminate toil’ in SRE?

Answer

A

Eliminating toil, work that is either
manual, repetitive, automatable, tactical, devoid of enduring value, scalable only linearly.

Question 6

Q

What do we mean by ‘Monitoring’ in SRE?

Answer

A

Collecting, processing, aggregating and displaying real-time quantitive data about a system such as query counts and types, error counts and types, processing times, server uptimes.

Question 7

Q

What are the 4 golden signals of monitoring in SRE?

Answer

A

Latency - time for a service to respond
Traffic - the number of requests/second
Errors - the number of responses that are errors
Saturation - the fraction of system capacity in use

Question 8

Q

What are the SRE practices?

Answer

A

practical alerting
being on call
emergency response
managing incidents
blameless postmortems

Question 9

Q

What is ‘Practical alerting’ in SRE?

Answer

A

A pull model of data collection provides one way to monitor large systems - a monitoring service regularly requests service data, often through a ‘/health’ endpoint.

Hard to scale because the monitor needs to know about every individual service.

Question 10

Q

Alternative to practical alerting in SRE?

Answer

A

Practical monitoring, rather than querying every service for health, get them to report in using a push model of data collection to a monitoring service, to logs. Easier as each service only needs to know the monitoring service.
Hard to sync up all the services in timestamps.

Question 11

Q

What is ‘Being on-call’ in SRE?

Answer

A

Coping with incidents within five minutes (for user-facing systems) and 30 minutes for other systems.
SREs spend 50% of their time on purely operational work, 50% of their time on engineering projects.

Question 12

Q

What is ‘emergency response’ in SRE?

Answer

A

A proper response when systems break takes a playbook, regular rehearsal, and training. A proactive approach is to deliberately break systems, making changes to their reliability to prevent these failures from recurring.

Question 13

Q

What is ‘Managing incidents’ in SRE?

Answer

A

Incident management limits disruption caused by an incident and restores normal business operations as quickly as possible.

Question 14

Q

Unmanaged vs managed incidents?

Answer

A

Unmanaged: characterised by teams fixated on technical problems, with poor communication and much freelancing
Managed: characterised by teams with separate responsibilities, with a clear communication and a plan.

Question 15

Q

What is ‘Blameless Postmortems’ in SRE?

Answer

A

Postmortem triggered by:
downtime beyond a threshold
data loss of any kind
intervention by an on-call engineer
resolution time above some threshold
manual incident discovery implying a monitoring failure

Question 16

Q

Common beliefs of devops and sre

Answer

A

no more silos (lack of collaboration)
accidents are normal (use them to grow, SRE uses error budget, devops uses feedback loop)
change should be gradual (less risk + automatic testing)
tooling and culture are interrelated
meaurement is crucial