Site Reliability Engineering Flashcards

1
Q

What is the first principle of DevOps?

A

Culture: Implemented by SRE through a separate SRE team or sometimes as consultants in development teams. It includes the blameless outlook and Organisational Learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the blameless outlook mean in DevOps?

A

It means a postmortem does not single out individuals or teams for bad behavior, avoiding finger-pointing to encourage open communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Organisational Learning in DevOps?

A

It means circulating postmortem reports among engineers as an opportunity to improve weaknesses and make the enterprise more resilient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the second principle of DevOps?

A

Automation: Implemented by SRE through software to eliminate toil, focusing on engineering work. SRE teams spend significant time on engineering (at least 50%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is toil in DevOps?

A

Operations work necessary to run a service that is manual, repetitive, automatable, tactical, and lacks enduring value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the ideal number of incidents for an SRE to deal with in a shift?

A

No more than two incidents per 8-12 hour shift to avoid rushed investigations and paper fatigue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the third principle of DevOps?

A

Lean: Implemented through a control loop driven by an error budget to limit work in progress and polarising time to separate development and operations tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an error budget in DevOps?

A

It is the difference between observed reliability and agreed reliability, helping to determine when developers can release more features or must hold back.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the fourth principle of DevOps?

A

Measurement: Implemented by selecting key metrics to monitor obsessively, including SLI, SLO, and SLA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Service Level Indicator (SLI)?

A

A quantitative measure of some aspect of the level of service provided.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Service Level Objective (SLO)?

A

A target value or range of values for a Service Level Indicator (SLI).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Service Level Agreement (SLA)?

A

A contract that sets out the consequences of meeting or missing an SLO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the fifth principle of DevOps?

A

Sharing: Implemented through the sharing of knowledge, tools, and techniques between development and operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Sharing Knowledge in DevOps involve?

A

Development advises operations of upcoming functionality, and operations advises development of performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does Sharing Tools and Techniques in DevOps involve?

A

Ensuring common ways to manage environments and techniques, allowing anyone to self-service deployments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What makes a good alert? Why might a Site Reliability Engineer (SRE) care?

A

A good alert is actionable and is for something that could not be fixed without a human being. If automated remediation is possible at least try that.

An SRE cares because they lose sleep over bad ones.

17
Q

What is a reliability theatre? Why might a Site Reliability Engineer care?

A

A traditional Network Operations Centre (NoC) or War Room is seen as a reliability theatre that impresses only the general public.

An SRE cares because it may limit the effectiveness of incident response.

18
Q

What is a snowflake? Why might a Site Reliability Engineer care?

A

A production server that is kept running through regular manual configuration tweaks made via the command line.

An SRE cares because they are hard to reproduce and debug.

19
Q

What are pets cattle and poultry? Why might a Site Reliability Engineer care?

A
  • Pets are virtual (snowflake) servers with names that need individual attention
  • Cattle are virtual servers with numbers that need group attention
  • Poultry are virtual containers with numbers that need group attention

An SRE cares because of their decreasing administrative cost.

20
Q

Why is autonomous > automated? Why might a Site Reliability Engineer care?

A

Because it is less work.

An SRE cares because autonomous systems can take away a world of pain from the on-call rotation.

21
Q

What advantages are there to embedding a Site Reliability Engineer in a development team?

A

It builds trust between SRE and development and SRE gets an input into system design from the very beginning.

22
Q

What is the right number of nines?

A

The right number of nines is a decision made on the basis of how much downtime the business can tolerate.

23
Q

Why is it dangerous to improve a system without revising its Service Level Agreement (SLA)?

A

Because customers will consider the delivered level of reliability to be the agreed level.