Week 9 - Site Reliability Engineering Flashcards
What does SRE stand for?
Site Reliability Engineering
What does Site Reliability Engineering do?
Work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labour
Name 5 principles of Site Reliability Engineering.
Embrace risk
Use service level objectives
Use of error budgets
Eliminate toil
Monitoring
What does maximising stability cause?
A reduce in feature deployment speed
An increase in feature deployment cost
Limits feature numbers
What does SLO stand for?
Service Level Objectives
What does SLI stand for?
Service Level Indicators
What are SLOs the target values for?
Service Level Indicators (SLIs)
Give 3 examples of SLIs.
Latency, throughput, uptime
What does publishing SLOs as SLAs do?
Sets expectations about how a service will perform, reducing unfounded complaints about its performance
What does SLA stand for?
Service Level Agreements
How is the error budget calculated?
The expected uptime is agreed (1)
The actual uptime is measured (2)
The error budget is the difference between (1) and (2)
Why is the error budget useful?
It says how much risk is allowed within a quarter. Once the agreed-upon number is exceeded, the team shifts its focus from the development of updates to improving reliability
What type of work counts as toil?
Manual (issuing commands, running scripts)
Repetitive (done over and over again)
Automatable (could be designed away)
Tactical (interrupt-driven and tactical)
Devoid of enduring value (nothing gets better)
Scalable only linearly (with traffic volume)
What is a typical goal for toil?
To keep toil below 50% of each Site Reliability Engineer’s time - at least 50% should be spent on engineering project work that will either reduce future toil or add service features
Monitoring involves collecting, processing, aggregating and displaying real-time quantitative data about a system. What sort of data could this be?
Query counts and types
Error counts and types
Processing times
Server uptimes
What are the four golden signals of monitoring?
Latency (the time for a service to respond)
Traffic (the number of requests/second)
Errors (the number of responses that are errors)
Saturation (the fraction of system capacity in use)
Name 5 practices of SRE.
Blameless postmortems
Managing incidents
Emergency response
Being on-call
Practical alerting
What does a pull model of data collection do?
Provides one way to monitor large systems - a monitoring device regularly requests service data, often through a “/health” endpoint. Hard to scale as the monitoring service must know about every individual service
What does a push model of data collection do?
Provides another way to monitor large systems - a service regularly sends data to a monitoring service (perhaps to a flat file (logs) or to a time-series database (metrics). Easier to scale as every individual service only needs to know about the monitoring service
How long should it take an SRE to cope with a user-facing incident?
5 minutes
How long should it take an SRE to cope with systems incidents other than user-facing ones?
30 minutes
What does a proper response take?
A playbook, regular rehearsal and training
What does a proactive approach to emergency response involve?
Deliberately breaking systems to watch how they fail, and make changes to improve reliability and prevent failures from recurring
What does incident management do?
Limits the disruption by an incident and restores normal business operations as quickly as possible
What 2 things can an incident be?
Unmanaged - characterised by teams fixated on technical problems, with poor communication and much freelancing
Managed - characterised by teams with separate responsibilities, clear communication and a plan
What can a postmortem be triggered by?
Downtime beyond a certain threshold
Data loss of any king
Intervention by an on-call engineer
Resolution time above some threshold
Manual incident discovery implying a monitoring failure
What is a blameless postmortem?
Focus is on identifying the causes of the incident, without indicting any individual - there is no counter-productive finger pointing or shaming of individuals
Name 5 common tenets of DevOps and SRE.
No more silos
Accidents are normal
Change should be gradual
Tooling and culture are interrelated
Measurement is crucial
What do silos lead to?
A lack of collaboration, and to incentives for purely local optimisation