SRE concepts Flashcards
Error budgets, SLXs, monitoring, observability, incident management
Can you explain the concept of “Error Budget”?
- Error budget represents the acceptable level of unreliability or downtime for a service while still meeting the service level objectives (SLOs).
- balance innovation and reliability.
How would you handle a critical incident?
- Immediate Response: Identify the issue’s scope and impact. Assemble an incident response team. Focus on restoring service quickly.
- Mitigation: Isolate the problem, roll back changes if needed, and implement temporary fixes to restore service stability.
- Investigation: Analyse logs, metrics, and system behaviour to determine the root cause. Share information among teams.
- Resolution and Post-Incident Analysis: Implement a permanent fix. Conduct a post-incident analysis to identify contributing factors and preventive measures.
How would you approach designing and implementing a monitoring and alerting system for a complex distributed application?
- Design a monitoring system with relevant metrics such as response time, error rates, resource utilisation, and custom application-specific metrics.
- Use tools like Prometheus, Grafana, or ELK stack.
- Set up meaningful alerts with proper thresholds and aggregation.
- Establish different alerting channels for different severity levels.
- Continuously refine alerts based on false positives/negatives
Explain the differences between horizontal and vertical scaling.
- Horizontal scaling involves adding more instances of the same component, distributing the load.
- Vertical scaling involves upgrading the resources of an existing instance.
- Horizontal scaling is suitable for distributing high traffic and ensuring availability.
- Vertical scaling is useful for improving performance of individual components.
What challenges might arise with horizontal and vertical scaling?
- horizontal scaling: synchronization issues and data consistency
- vertical scaling: resource limitations on a single machine.
How would you describe a robust CI/CD pipeline?
- A robust CI/CD pipeline includes stages for building, testing, deploying, and monitoring.
- Automated tests cover unit, integration, and end-to-end scenarios.
- pipeline helps catch bugs early and ensures consistent deployments.
Discuss the importance of automated testing and continuous integration/continuous deployment (CI/CD).
- Automated testing ensures code quality and reliability.
- CI/CD automates deployment pipelines for faster and safer releases
How would you troubleshoot the issue and identify the root cause of performance degradation in production?
- Analyse performance metrics, logs, and resource utilisation.
- Identify the specific code changes causing the issue.
- If necessary, roll back the update using version control systems.
- To prevent future occurrences, improve testing practices, and consider canary deployments or feature flags.
How do you ensure that incidents are thoroughly investigated and learnings are applied to prevent future occurrences?
- Incident response involves immediate action to restore service.
- Post-incident analysis includes a thorough review of what happened, why it happened, and how to prevent it.
- Ensure all contributing factors are addressed.
- Learnings are shared across teams, leading to process improvements and preventing similar incidents.
What disaster recovery strategies and techniques would you employ to ensure high availability and data integrity for a critical application?
- Redundancy: Deploy across multiple availability zones/regions.
- Backup and Restore: Regularly back up data and test restoration procedures.
- Failover: Automatically switch to standby systems in case of failure.
- Chaos Engineering: Intentionally introduce failures to test the system’s resilience.
Explain the concept of “circuit breakers” and their role in preventing cascading failures in a microservices architecture.
detect service degradation
block requests to a failing service
isolate failing components
give them time to recover
How would you implement and manage circuit breakers effectively?
- Set thresholds for errors or latency
- Manage thresholds dynamically based on real-time performance.
How do you navigate collaborating with development teams to achieve both reliability and innovation?
- Emphasise the common goal of reliability and innovation.
- Engage in open communication, share data and insights, and involve both teams in decision-making.
- Use incident learnings to advocate for reliability improvements without hindering innovation.
- Align on shared metrics and incentives to foster a culture of collaboration
How is an error budget calculated and how would you prioritise between new features and reliability?
- calculated based on the difference between 100% and the desired SLO percentage.
- prioritise between new features and reliability by considering the remaining error budget.