Week 9 - Site Reliability Engineering Flashcards by Kate Belson

What does SRE stand for?

Site Reliability Engineering

How well did you know this?

Not at all

Perfectly

What does Site Reliability Engineering do?

Work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labour

How well did you know this?

Not at all

Perfectly

Name 5 principles of Site Reliability Engineering.

Embrace risk

Use service level objectives

Use of error budgets

Eliminate toil

Monitoring

How well did you know this?

Not at all

Perfectly

What does maximising stability cause?

A reduce in feature deployment speed

An increase in feature deployment cost

Limits feature numbers

How well did you know this?

Not at all

Perfectly

What does SLO stand for?

Service Level Objectives

How well did you know this?

Not at all

Perfectly

What does SLI stand for?

Service Level Indicators

How well did you know this?

Not at all

Perfectly

What are SLOs the target values for?

Service Level Indicators (SLIs)

How well did you know this?

Not at all

Perfectly

Give 3 examples of SLIs.

Latency, throughput, uptime

How well did you know this?

Not at all

Perfectly

What does publishing SLOs as SLAs do?

Sets expectations about how a service will perform, reducing unfounded complaints about its performance

How well did you know this?

Not at all

Perfectly

What does SLA stand for?

Service Level Agreements

How well did you know this?

Not at all

Perfectly

How is the error budget calculated?

The expected uptime is agreed (1)

The actual uptime is measured (2)

The error budget is the difference between (1) and (2)

How well did you know this?

Not at all

Perfectly

Why is the error budget useful?

It says how much risk is allowed within a quarter. Once the agreed-upon number is exceeded, the team shifts its focus from the development of updates to improving reliability

How well did you know this?

Not at all

Perfectly

What type of work counts as toil?

Manual (issuing commands, running scripts)
Repetitive (done over and over again)
Automatable (could be designed away)
Tactical (interrupt-driven and tactical)
Devoid of enduring value (nothing gets better)
Scalable only linearly (with traffic volume)

How well did you know this?

Not at all

Perfectly

What is a typical goal for toil?

To keep toil below 50% of each Site Reliability Engineer’s time - at least 50% should be spent on engineering project work that will either reduce future toil or add service features

How well did you know this?

Not at all

Perfectly

Monitoring involves collecting, processing, aggregating and displaying real-time quantitative data about a system. What sort of data could this be?

Query counts and types

Error counts and types

Processing times

Server uptimes

How well did you know this?

Not at all

Perfectly

What are the four golden signals of monitoring?

Study These Flashcards

Latency (the time for a service to respond)

Traffic (the number of requests/second)

Errors (the number of responses that are errors)

Saturation (the fraction of system capacity in use)

Name 5 practices of SRE.

Study These Flashcards

Blameless postmortems
Managing incidents
Emergency response
Being on-call
Practical alerting

What does a pull model of data collection do?

Study These Flashcards

Provides one way to monitor large systems - a monitoring device regularly requests service data, often through a “/health” endpoint. Hard to scale as the monitoring service must know about every individual service

What does a push model of data collection do?

Study These Flashcards

Provides another way to monitor large systems - a service regularly sends data to a monitoring service (perhaps to a flat file (logs) or to a time-series database (metrics). Easier to scale as every individual service only needs to know about the monitoring service

How long should it take an SRE to cope with a user-facing incident?

Study These Flashcards

5 minutes

How long should it take an SRE to cope with systems incidents other than user-facing ones?

Study These Flashcards

30 minutes

What does a proper response take?

Study These Flashcards

A playbook, regular rehearsal and training

What does a proactive approach to emergency response involve?

Study These Flashcards

Deliberately breaking systems to watch how they fail, and make changes to improve reliability and prevent failures from recurring

What does incident management do?

Study These Flashcards

Limits the disruption by an incident and restores normal business operations as quickly as possible

What 2 things can an incident be?

Unmanaged - characterised by teams fixated on technical problems, with poor communication and much freelancing Managed - characterised by teams with separate responsibilities, clear communication and a plan

What can a postmortem be triggered by?

Downtime beyond a certain threshold Data loss of any king Intervention by an on-call engineer Resolution time above some threshold Manual incident discovery implying a monitoring failure

What is a blameless postmortem?

Focus is on identifying the causes of the incident, without indicting any individual - there is no counter-productive finger pointing or shaming of individuals

Name 5 common tenets of DevOps and SRE.

No more silos Accidents are normal Change should be gradual Tooling and culture are interrelated Measurement is crucial

What do silos lead to?

A lack of collaboration, and to incentives for purely local optimisation

Week 9 - Site Reliability Engineering Flashcards

(29 cards)