Site Reliability Engineering Flashcards

1
Q

What is Site Reliability Engineering?

A

SRE (Site Reliability Engineering) is a software engineering approach that combines development and operations principles to build and maintain reliable, scalable, and highly available systems. First team founded by Treynor Sloss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are common KPIs in SRE?

A
  • Service Level Objectives (SLOs)
  • Mean Time to Detect (MTTD)
  • Mean Time to Resolve (MTTR)
  • Availability
  • Error Rates
  • Service Level Agreements (SLAs)
  • Time to Mitigate.
  • Change Success Rate
  • Resource Utilization
  • Customer Impact Metrics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Service Level Objectives (SLO)?

A

SLOs define the desired level of service reliability and availability. They specify measurable targets for metrics such as uptime, response time, error rate, or throughput, which are used to evaluate system performance against user expectations.

Example:
“Achieving an average response time of less than 200 milliseconds for API requests.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Mean Time to Detect (MTTD)?

A

MTTD measures the time taken to detect an incident or anomaly from the moment it occurs. A lower MTTD indicates efficient monitoring and alerting systems that enable timely incident response.

Example:
“Detecting incidents within an average of 5 minutes from the time they occur.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Mean Time to Resolve (MTTR)?

A

MTTR represents the average time required to resolve an incident or restore normal operations. Lower MTTR indicates efficient incident response and faster recovery times.

Example:
“Resolving incidents and restoring normal operations within an average of 30 minutes.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between SLOs and SLAs?

A

SLOs are internal goals and SLAs are formal agreements with the customer. SLAs can lead to contractual penalties when violated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Availability?

A

Availability measures the proportion of time a service is accessible and functional for users. It is usually expressed as a percentage, such as “99.9% uptime.” Higher availability indicates better reliability and reduced service disruptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Error Rates?

A

Error rates track the occurrence of errors or failures within a system or service. It could include metrics like the rate of HTTP 500 errors, failed requests, or exceptions. Monitoring and reducing error rates indicate the system’s stability and quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are risks when not doing good error monitoring and fixing them? What is a supporting theory for this?

A

Risks when not monitoring errors properly:
- Service Disruptions
- Decreased User Satisfaction
- Increased MTTD and MTTR
- Lack of Proactive Issue Resolution
- Missed Performance Optimization
- Limited Root Cause Analysis
- Decreased SLA Compliance
- Missed Opportunities for Continuous Improvement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which companies have applied SRE successfully?

A

Google
Netflix
AirBnb
LinkedIn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Time To Mitigate?

A

Time to mitigate measures how quickly the SRE team can mitigate the impact of an incident or take actions to limit its effects. It reflects the efficiency of incident response and the ability to contain and minimize disruptions.

Note: Compared to Mean Time To Resolve, mitigation just limits the effect but not resolves the issue itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Change Success Rate?

A

Change success rate evaluates the percentage of changes or deployments that are successfully implemented without causing incidents or service disruptions. A higher change success rate indicates effective change management processes and minimized risk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Resource Utilization?

A

Resource utilization metrics monitor the efficiency and optimization of system resources such as CPU, memory, disk, and network usage. Balancing resource utilization ensures optimal performance and avoids bottlenecks or resource wastage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Customer Impact Metrics?

A

These metrics focus on the impact of incidents or service disruptions on customers, such as user satisfaction, customer support response time, or customer churn rate. Understanding and improving customer-centric metrics help prioritize user experience and customer satisfaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are common critics about SRE?

A
  1. Resource intensive
  2. Learning curve
  3. Organisational resistance
  4. Trade-offs and feature priotisation
  5. Tooling and its maintenance
  6. Overemphasis on metrics
  7. Lack of standardization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are benefits of doing SRE?

A
  1. Improved reliability
  2. Faster incident response
  3. Proactive monitoring and alerting
  4. Continuous improvement
  5. Foster collaboration between ops and devs
  6. Cost optimisation (discover and resolve bottlenecks)
  7. Business resilience (prepare for the unexpected)
  8. Better self-confidence
17
Q

What are the core building blocks of SRE?

A
  1. SLOs
  2. Error budgets
  3. Monitoring and Alerting
  4. Incident Mgmt and Response (process)
  5. Continuous delivery (using canary, feature flags, etc.)
  6. Capacity planning
  7. Automation and tooling (alerting, monitoring and CD tools)
  8. Culture and Collaboration (blameless!)
18
Q

What is a common incident response when doing SRE?

A
  1. Incident Identification
  2. Incident Triage
  3. Communication and Collaboration
  4. (Incident Escalation)
  5. Incident Mitigation
  6. Post-Incident Analysis
  7. Incident Documentation
  8. Continuous Learning
19
Q

How is incident management different compared to ITIL?

A
  1. Integration of Development and Operations - SRE encourages to bring Ops and Devs closer together and even merge the roles. While in ITIL this is very distinct.
  2. Incident Prioritization: ITIL focuses on contractual obligations while SRE also focuses on customer experience
  3. Culture: SRE puts emphasis on blameless handling of root causes. While ITIL is minding only the process.
  4. Philosophy: ITIL handles business needs and is more reactive. While SRE is more proactive and focused on engineering (in addition to SLAs).