L19 - Site Reliability Engineering Flashcards
What is the role of an SRE?
- An implementation of the DevOps principles
- Responsible for reliability, availability and performance of distributed systems.
What other roles to SRE’s usually collaborate closely with?
Software Developers and System Administrators.
What principles does the SRE role work by?
CALMS
- Culture
- Automation
- Lean
- Measurements
- Sharing
SRE personnel can either be independent teams or embedded into cross-functional teams. When is each appropriate?
- Independent for large firms such as Google, Microsoft etc.
- Embedded for smaller, agile organisations.
What is a benefit of having independent SRE teams?
Easier to share knowledge with other SRE teams across the organisation.
What is a benefit of having embedded SRE teams?
Less communication overhead when collaborating with team.
Regarding the Culture principle. What are the 2 main aspects of ensuring good culture in an SRE team?
Blamelessness -> No finger pointing, culture of confidence and unity.
Shared Knowledge -> Tight communication loops and shared post mortem reports.
What is a Post-mortem report?
A log of an incident, the resulting impact, and the actions taken to resolve the issue.
Regarding the Automation principle, what are the 4 reasons this is important?
Automation helps:
- Eliminate Toil tasks.
- Reduce human error
- Faster
- More reliable
Define a toil task…
- Tasks that are tedious, repetitive, manual.
What type of tasks are ideal for automation?
Toil tasks
How many incidents should an SRE deal with per shift? Give reasons…
- 2 incidents in an 8-12 hour shift.
- Prevents paper fatigue
- Ensures higher quality resolutions as opposed to rushing solutions.
- Reduces mental context shifting
How can SRE’s implement the Lean culture principle?
- Ensure low backlog of tasks and work in progresses
- Use a Control Loop driven by an Error Budget to determine capacity for system downtime.
- Polarizing time -> Ensure SRE’s know the tasks to work on throughout the day, reducing context switched between tasks and improving productivity.
In a Control Loop driven by an Error Budget, what happens if the Error Budget is positive or negative?
Positive: Developers can release more features into production.
Negative: Developers can’t release any more feature into production.
Why is Measurements an important SRE principle?
- Data should be collected and tracked to compare against benchmark metrics.