SRE concepts Flashcards

Question 1

Q

Can you explain the concept of “Error Budget”?

Answer

A

Error budget represents the acceptable level of unreliability or downtime for a service while still meeting the service level objectives (SLOs).
balance innovation and reliability.

Question 2

Q

How would you handle a critical incident?

Answer

A

Immediate Response: Identify the issue’s scope and impact. Assemble an incident response team. Focus on restoring service quickly.
Mitigation: Isolate the problem, roll back changes if needed, and implement temporary fixes to restore service stability.
Investigation: Analyse logs, metrics, and system behaviour to determine the root cause. Share information among teams.
Resolution and Post-Incident Analysis: Implement a permanent fix. Conduct a post-incident analysis to identify contributing factors and preventive measures.

Question 3

Q

How would you approach designing and implementing a monitoring and alerting system for a complex distributed application?

Answer

A

Design a monitoring system with relevant metrics such as response time, error rates, resource utilisation, and custom application-specific metrics.
Use tools like Prometheus, Grafana, or ELK stack.
Set up meaningful alerts with proper thresholds and aggregation.
Establish different alerting channels for different severity levels.
Continuously refine alerts based on false positives/negatives

Question 4

Q

Explain the differences between horizontal and vertical scaling.

Answer

A

Horizontal scaling involves adding more instances of the same component, distributing the load.
Vertical scaling involves upgrading the resources of an existing instance.
Horizontal scaling is suitable for distributing high traffic and ensuring availability.
Vertical scaling is useful for improving performance of individual components.

Question 5

Q

What challenges might arise with horizontal and vertical scaling?

Answer

A

horizontal scaling: synchronization issues and data consistency
vertical scaling: resource limitations on a single machine.

Question 6

Q

How would you describe a robust CI/CD pipeline?

Answer

A

A robust CI/CD pipeline includes stages for building, testing, deploying, and monitoring.
Automated tests cover unit, integration, and end-to-end scenarios.
pipeline helps catch bugs early and ensures consistent deployments.

Question 7

Q

Discuss the importance of automated testing and continuous integration/continuous deployment (CI/CD).

Answer

A

Automated testing ensures code quality and reliability.
CI/CD automates deployment pipelines for faster and safer releases

Question 8

Q

How would you troubleshoot the issue and identify the root cause of performance degradation in production?

Answer

A

Analyse performance metrics, logs, and resource utilisation.
Identify the specific code changes causing the issue.
If necessary, roll back the update using version control systems.
To prevent future occurrences, improve testing practices, and consider canary deployments or feature flags.

Question 9

Q

How do you ensure that incidents are thoroughly investigated and learnings are applied to prevent future occurrences?

Answer

A

Incident response involves immediate action to restore service.
Post-incident analysis includes a thorough review of what happened, why it happened, and how to prevent it.
Ensure all contributing factors are addressed.
Learnings are shared across teams, leading to process improvements and preventing similar incidents.

Question 10

Q

What disaster recovery strategies and techniques would you employ to ensure high availability and data integrity for a critical application?

Answer

A

Redundancy: Deploy across multiple availability zones/regions.
Backup and Restore: Regularly back up data and test restoration procedures.
Failover: Automatically switch to standby systems in case of failure.
Chaos Engineering: Intentionally introduce failures to test the system’s resilience.

Question 11

Q

Explain the concept of “circuit breakers” and their role in preventing cascading failures in a microservices architecture.

Answer

A

detect service degradation
block requests to a failing service
isolate failing components
give them time to recover

Question 12

Q

How would you implement and manage circuit breakers effectively?

Answer

A

Set thresholds for errors or latency
Manage thresholds dynamically based on real-time performance.

Question 13

Q

How do you navigate collaborating with development teams to achieve both reliability and innovation?

Answer

A

Emphasise the common goal of reliability and innovation.
Engage in open communication, share data and insights, and involve both teams in decision-making.
Use incident learnings to advocate for reliability improvements without hindering innovation.
Align on shared metrics and incentives to foster a culture of collaboration

Question 14

Q

How is an error budget calculated and how would you prioritise between new features and reliability?

Answer

A

calculated based on the difference between 100% and the desired SLO percentage.
prioritise between new features and reliability by considering the remaining error budget.

SRE concepts Flashcards

Error budgets, SLXs, monitoring, observability, incident management (14 cards)