Site Reliability Engineering Flashcards

Question 1

Q

What is Site Reliability Engineering?

Answer

A

SRE (Site Reliability Engineering) is a software engineering approach that combines development and operations principles to build and maintain reliable, scalable, and highly available systems. First team founded by Treynor Sloss.

Question 2

Q

What are common KPIs in SRE?

Answer

A

Service Level Objectives (SLOs)
Mean Time to Detect (MTTD)
Mean Time to Resolve (MTTR)
Availability
Error Rates
Service Level Agreements (SLAs)
Time to Mitigate.
Change Success Rate
Resource Utilization
Customer Impact Metrics

Question 3

Q

What are Service Level Objectives (SLO)?

Answer

A

SLOs define the desired level of service reliability and availability. They specify measurable targets for metrics such as uptime, response time, error rate, or throughput, which are used to evaluate system performance against user expectations.

Example:
“Achieving an average response time of less than 200 milliseconds for API requests.”

Question 4

Q

What is Mean Time to Detect (MTTD)?

Answer

A

MTTD measures the time taken to detect an incident or anomaly from the moment it occurs. A lower MTTD indicates efficient monitoring and alerting systems that enable timely incident response.

Example:
“Detecting incidents within an average of 5 minutes from the time they occur.”

Question 5

Q

What is Mean Time to Resolve (MTTR)?

Answer

A

MTTR represents the average time required to resolve an incident or restore normal operations. Lower MTTR indicates efficient incident response and faster recovery times.

Example:
“Resolving incidents and restoring normal operations within an average of 30 minutes.”

Question 6

Q

What is the difference between SLOs and SLAs?

Answer

A

SLOs are internal goals and SLAs are formal agreements with the customer. SLAs can lead to contractual penalties when violated.

Question 7

Q

What is Availability?

Answer

A

Availability measures the proportion of time a service is accessible and functional for users. It is usually expressed as a percentage, such as “99.9% uptime.” Higher availability indicates better reliability and reduced service disruptions.

Question 8

Q

What are Error Rates?

Answer

A

Error rates track the occurrence of errors or failures within a system or service. It could include metrics like the rate of HTTP 500 errors, failed requests, or exceptions. Monitoring and reducing error rates indicate the system’s stability and quality.

Question 9

Q

What are risks when not doing good error monitoring and fixing them? What is a supporting theory for this?

Answer

A

Risks when not monitoring errors properly:
- Service Disruptions
- Decreased User Satisfaction
- Increased MTTD and MTTR
- Lack of Proactive Issue Resolution
- Missed Performance Optimization
- Limited Root Cause Analysis
- Decreased SLA Compliance
- Missed Opportunities for Continuous Improvement

Question 10

Q

Which companies have applied SRE successfully?

Answer

A

Google
Netflix
AirBnb
LinkedIn

Question 11

Q

What is the Time To Mitigate?

Answer

A

Time to mitigate measures how quickly the SRE team can mitigate the impact of an incident or take actions to limit its effects. It reflects the efficiency of incident response and the ability to contain and minimize disruptions.

Note: Compared to Mean Time To Resolve, mitigation just limits the effect but not resolves the issue itself.

Question 12

Q

What is the Change Success Rate?

Answer

A

Change success rate evaluates the percentage of changes or deployments that are successfully implemented without causing incidents or service disruptions. A higher change success rate indicates effective change management processes and minimized risk.

Question 13

Q

What is Resource Utilization?

Answer

A

Resource utilization metrics monitor the efficiency and optimization of system resources such as CPU, memory, disk, and network usage. Balancing resource utilization ensures optimal performance and avoids bottlenecks or resource wastage.

Question 14

Q

What are Customer Impact Metrics?

Answer

A

These metrics focus on the impact of incidents or service disruptions on customers, such as user satisfaction, customer support response time, or customer churn rate. Understanding and improving customer-centric metrics help prioritize user experience and customer satisfaction.

Question 15

Q

What are common critics about SRE?

Answer

A

Resource intensive
Learning curve
Organisational resistance
Trade-offs and feature priotisation
Tooling and its maintenance
Overemphasis on metrics
Lack of standardization

Question 16

Q

What are benefits of doing SRE?

Answer

Study These Flashcards

A

Improved reliability
Faster incident response
Proactive monitoring and alerting
Continuous improvement
Foster collaboration between ops and devs
Cost optimisation (discover and resolve bottlenecks)
Business resilience (prepare for the unexpected)
Better self-confidence

Question 17

Q

What are the core building blocks of SRE?

Answer

Study These Flashcards

A

SLOs
Error budgets
Monitoring and Alerting
Incident Mgmt and Response (process)
Continuous delivery (using canary, feature flags, etc.)
Capacity planning
Automation and tooling (alerting, monitoring and CD tools)
Culture and Collaboration (blameless!)

Question 18

Q

What is a common incident response when doing SRE?

Answer

Study These Flashcards

A

Incident Identification
Incident Triage
Communication and Collaboration
(Incident Escalation)
Incident Mitigation
Post-Incident Analysis
Incident Documentation
Continuous Learning

Question 19

Q

How is incident management different compared to ITIL?

Answer

Study These Flashcards

A

Integration of Development and Operations - SRE encourages to bring Ops and Devs closer together and even merge the roles. While in ITIL this is very distinct.
Incident Prioritization: ITIL focuses on contractual obligations while SRE also focuses on customer experience
Culture: SRE puts emphasis on blameless handling of root causes. While ITIL is minding only the process.
Philosophy: ITIL handles business needs and is more reactive. While SRE is more proactive and focused on engineering (in addition to SLAs).

Site Reliability Engineering Flashcards

(19 cards)