SRE Interview Questions Flashcards
What is SLO
SLO stands for service level objective. Essentially, it’s an agreement between a service provider and the user to measure performance. An SLO may measure factors like availability or response time. For example, an SLO might state that teams resolve any problems a customer reports with the software within 24 hours. An SLO is important because it helps developers and IT teams meet customer expectations for performance, which leads to improved customer satisfaction. It also can help teams understand the goals for the software and take steps to meet them.
What is error budget?
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Error Budgets first depend on establishing an SLO (Service Level Objective). SLOs are made up of an objective, a SLI (Service Level Indicator), and a timeframe.
Objective: The desired level of succcess, noted as a percentage
SLI: an evaluation used to distinguish number of failed events
Timeframe: enforcing a recency bias to the SLI
Here is an example of these elements:
Objective: 99.95%
SLI: 95th percentile latency of api requests over 5 mins is < 100ms
Timeframe: previous 28 days
Taken all together, the above example SLO would be: 99.95% of the 95th percentile latency of api requests over 5 mins is < 100ms over the previous 28 days
The Error Budget is then 1 - Objective of the SLO, in this case (1 - .9995 = .0005). Using our 28 day timeframe, the “budget” for errors is 20.16 minutes (.0005 * (28 * 24 * 60))
While the above example shows the SLI as a latency measurement, it is important to note that other measurements (such as % errors) are also good elements to use for SLIs.
What is the difference between site reliability engineers and development operations?
In a nutshell, DevOps Engineers are ops-focused engineers who solve development pipeline problems. Site Reliability Engineers are development-focused engineers who solve operational/scale/reliability problems.
DevOps focuses more on the merging and removal of the silos, Site Reliability Engineering focuses on the craft of reliability.`
Question 1: How do you decide if the team should work on new features or paying down technical debt?
For SRE candidates, this topic is a chance to show how you approach seemingly insurmountable conflicts. Everyone thinks their goal or issue is the most important; how do you actually set priorities that people can (mostly) agree on and work on? When is technical debt acceptable (or inevitable)? How do you pay it down?
Technical dept strategies
- Low handing fruit
- Measure dept in terms of payment, like human hours, number of error
- Plan side by side solving technical dept in every sprint
How do you go about setting SLOs and SLIs and how do you make adjustments when necessary?
Answer with What is SLO? What is SLI and related to SLO.
Hiring managers should dig into how the candidate identifies and defines SLOs and SLIs; if you’re the candidate, you should be prepared to speak about how you approach these metrics. Moreover, make sure you can discuss a thoughtful process for reevaluating and optimizing those measurements over time.
TODO: Read more about identifying SLO and SLI.
Which of the three pillars of observability is most important to you? Which one do you feel you need to get more exposure in?
The three pillars here are logging, metrics, and tracing. Observability as a whole is intrinsic to the SRE field.
Which pillar would help you determine those [signals] the best?
These will eventually lead to your SLO/SLI measurements. Showing interest in one or more of the pillars shows you are ready to grow into your role.
Measuring Golden signals metrics, e.g. error could come from logs.
What are four golden signal
If you can only measure four metrics of your user-facing system, focus on these four.
The Four Golden Signals
- Latency
- Traffic
- Error rate
- Saturation (How much service is full, can it handle twice the traffic, can it handle 10% per cent more)
How have you implemented process improvements and other changes in the past?
As a hiring manager, I would ask for examples where the individual has shown such qualities, how they go about it, and what has been achieved
Process improvements
- linting was not part of the standard process, added the standard CI pipeline using super-linter
- Improve the change management for the Github access, create tickets, manager approval and pipeline to add to Github
How do you balance the wishes and needs of different authorities on the team?
As a hiring manager, I would probe how willing the individual is to work between the two authorities of development and operations.
Thorough technical knowledge, a sincere willingness to improve, a dedicated focus on the task instead of the ego – [all are] essential for SREs
- Create a weekly huddle with development and ops
- Track the work items for improvement
How does customer experience and/or employee experience inform your SRE strategy?
You can talk metrics all day but can’t connect those metrics to outcomes in terms of customers or internal users.
- Expertise in CX and EX management could be a great idea
SRE could contribute value throughout the entire product cycle to achieve a better customer experience outcome:
-
Collaborate with product owners - get involved in early design phases and contribute value before features are implemented
Evaluate existing systems to detect flawed design patterns and customer experience gaps in the product - Evaluate architecture e.g., if too many layers are in the system to work efficiently or reduce complexity for maintainability and scale
- Provide industry standard recommendations for increased reliability before product launch, such as caching, deploying load balancers, and other proven methodologies to implement these
-
Engage engineering and help develop software post incidents to improve reliability thinking across the organization - note, however, that SRE is and should be separate from feature development to enable good tension and good conflict; conflict is a good thing as it helps propel better products and policies forward in the pursuit of outage avoidance
This is in addition to managing incidents and the on-call process.
How do you learn and keep up with industry trends and toolchains?
- Read weekly articles for SRE focus
- Attended in personal and virtual events.
- There are a bunch of SRE-focused meetups. The Awesome Site Reliability Engineering repository on GitHub provides a curated list of resources.