Week 9 - Site Reliability Engineering Flashcards
What does SRE stand for?
Site Reliability Engineering
What does Site Reliability Engineering do?
Work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labour
Name 5 principles of Site Reliability Engineering.
Embrace risk
Use service level objectives
Use of error budgets
Eliminate toil
Monitoring
What does maximising stability cause?
A reduce in feature deployment speed
An increase in feature deployment cost
Limits feature numbers
What does SLO stand for?
Service Level Objectives
What does SLI stand for?
Service Level Indicators
What are SLOs the target values for?
Service Level Indicators (SLIs)
Give 3 examples of SLIs.
Latency, throughput, uptime
What does publishing SLOs as SLAs do?
Sets expectations about how a service will perform, reducing unfounded complaints about its performance
What does SLA stand for?
Service Level Agreements
How is the error budget calculated?
The expected uptime is agreed (1)
The actual uptime is measured (2)
The error budget is the difference between (1) and (2)
Why is the error budget useful?
It says how much risk is allowed within a quarter. Once the agreed-upon number is exceeded, the team shifts its focus from the development of updates to improving reliability
What type of work counts as toil?
Manual (issuing commands, running scripts)
Repetitive (done over and over again)
Automatable (could be designed away)
Tactical (interrupt-driven and tactical)
Devoid of enduring value (nothing gets better)
Scalable only linearly (with traffic volume)
What is a typical goal for toil?
To keep toil below 50% of each Site Reliability Engineer’s time - at least 50% should be spent on engineering project work that will either reduce future toil or add service features
Monitoring involves collecting, processing, aggregating and displaying real-time quantitative data about a system. What sort of data could this be?
Query counts and types
Error counts and types
Processing times
Server uptimes