Site Reliability Engineering Flashcards
What is the first principle of DevOps?
Culture: Implemented by SRE through a separate SRE team or sometimes as consultants in development teams. It includes the blameless outlook and Organisational Learning.
What does the blameless outlook mean in DevOps?
It means a postmortem does not single out individuals or teams for bad behavior, avoiding finger-pointing to encourage open communication.
What is Organisational Learning in DevOps?
It means circulating postmortem reports among engineers as an opportunity to improve weaknesses and make the enterprise more resilient.
What is the second principle of DevOps?
Automation: Implemented by SRE through software to eliminate toil, focusing on engineering work. SRE teams spend significant time on engineering (at least 50%).
What is toil in DevOps?
Operations work necessary to run a service that is manual, repetitive, automatable, tactical, and lacks enduring value.
What is the ideal number of incidents for an SRE to deal with in a shift?
No more than two incidents per 8-12 hour shift to avoid rushed investigations and paper fatigue.
What is the third principle of DevOps?
Lean: Implemented through a control loop driven by an error budget to limit work in progress and polarising time to separate development and operations tasks.
What is an error budget in DevOps?
It is the difference between observed reliability and agreed reliability, helping to determine when developers can release more features or must hold back.
What is the fourth principle of DevOps?
Measurement: Implemented by selecting key metrics to monitor obsessively, including SLI, SLO, and SLA.
What is a Service Level Indicator (SLI)?
A quantitative measure of some aspect of the level of service provided.
What is a Service Level Objective (SLO)?
A target value or range of values for a Service Level Indicator (SLI).
What is a Service Level Agreement (SLA)?
A contract that sets out the consequences of meeting or missing an SLO.
What is the fifth principle of DevOps?
Sharing: Implemented through the sharing of knowledge, tools, and techniques between development and operations.
What does Sharing Knowledge in DevOps involve?
Development advises operations of upcoming functionality, and operations advises development of performance.
What does Sharing Tools and Techniques in DevOps involve?
Ensuring common ways to manage environments and techniques, allowing anyone to self-service deployments.
What makes a good alert? Why might a Site Reliability Engineer (SRE) care?
A good alert is actionable and is for something that could not be fixed without a human being. If automated remediation is possible at least try that.
An SRE cares because they lose sleep over bad ones.
What is a reliability theatre? Why might a Site Reliability Engineer care?
A traditional Network Operations Centre (NoC) or War Room is seen as a reliability theatre that impresses only the general public.
An SRE cares because it may limit the effectiveness of incident response.
What is a snowflake? Why might a Site Reliability Engineer care?
A production server that is kept running through regular manual configuration tweaks made via the command line.
An SRE cares because they are hard to reproduce and debug.
What are pets cattle and poultry? Why might a Site Reliability Engineer care?
- Pets are virtual (snowflake) servers with names that need individual attention
- Cattle are virtual servers with numbers that need group attention
- Poultry are virtual containers with numbers that need group attention
An SRE cares because of their decreasing administrative cost.
Why is autonomous > automated? Why might a Site Reliability Engineer care?
Because it is less work.
An SRE cares because autonomous systems can take away a world of pain from the on-call rotation.
What advantages are there to embedding a Site Reliability Engineer in a development team?
It builds trust between SRE and development and SRE gets an input into system design from the very beginning.
What is the right number of nines?
The right number of nines is a decision made on the basis of how much downtime the business can tolerate.
Why is it dangerous to improve a system without revising its Service Level Agreement (SLA)?
Because customers will consider the delivered level of reliability to be the agreed level.