Week 9 Flashcards
What are the 5 principles of SRE?
- embrace risk
- use service level objectives
- use of error budgets
- eliminate toil
- monitoring
What do we mean by ‘embrace risk’ in SRE?
Increased reliability may be worse for a service and its users
Maximising stability: reduces feature deployment speed,
increases feature deployment cost
limits feature numbers
What do we mean by ‘use service level objectives’ in SRE?
Use of service level objectives (SLOs), which are target values for service level indicators (SLIs), such as latency, throughput, or uptime. These are published as service level agreements in contracts.
What do we mean by ‘use error budgets’ in SRE?
Balancing reliability and innovation
1. the expected uptime is agreed
2. the actual uptime is measured
3. the error budget is the difference between 1 and 2
What do we mean by ‘eliminate toil’ in SRE?
Eliminating toil, work that is either
manual, repetitive, automatable, tactical, devoid of enduring value, scalable only linearly.
What do we mean by ‘Monitoring’ in SRE?
Collecting, processing, aggregating and displaying real-time quantitive data about a system such as query counts and types, error counts and types, processing times, server uptimes.
What are the 4 golden signals of monitoring in SRE?
Latency - time for a service to respond
Traffic - the number of requests/second
Errors - the number of responses that are errors
Saturation - the fraction of system capacity in use
What are the SRE practices?
- practical alerting
- being on call
- emergency response
- managing incidents
- blameless postmortems
What is ‘Practical alerting’ in SRE?
A pull model of data collection provides one way to monitor large systems - a monitoring service regularly requests service data, often through a ‘/health’ endpoint.
Hard to scale because the monitor needs to know about every individual service.
Alternative to practical alerting in SRE?
Practical monitoring, rather than querying every service for health, get them to report in using a push model of data collection to a monitoring service, to logs. Easier as each service only needs to know the monitoring service.
Hard to sync up all the services in timestamps.
What is ‘Being on-call’ in SRE?
Coping with incidents within five minutes (for user-facing systems) and 30 minutes for other systems.
SREs spend 50% of their time on purely operational work, 50% of their time on engineering projects.
What is ‘emergency response’ in SRE?
A proper response when systems break takes a playbook, regular rehearsal, and training. A proactive approach is to deliberately break systems, making changes to their reliability to prevent these failures from recurring.
What is ‘Managing incidents’ in SRE?
Incident management limits disruption caused by an incident and restores normal business operations as quickly as possible.
Unmanaged vs managed incidents?
Unmanaged: characterised by teams fixated on technical problems, with poor communication and much freelancing
Managed: characterised by teams with separate responsibilities, with a clear communication and a plan.
What is ‘Blameless Postmortems’ in SRE?
Postmortem triggered by:
downtime beyond a threshold
data loss of any kind
intervention by an on-call engineer
resolution time above some threshold
manual incident discovery implying a monitoring failure