Risk Flashcards

1
Q

The cost of redundant machine/compute resources

A

The cost associated with redundant equipment that, for example, allows us to take systems offline for routine or unforeseen maintenance, or provides space for us to store parity code blocks that provide a minimum data durability guarantee.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The opportunity cost

A

The cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Time-based availability

A

uptime / (uptime + downtime)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is service availability expressed?

A

the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Aggregate availability

A

successful requests / total requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is risk tolerance defined?

A

On a service level by working with Product Owners

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the factors considered when defining risk tolerance?

A

What level of availability is required?

Do different types of failures have different effects on the service?

How can we use the service cost to help locate a service on the risk continuum?

What other service metrics are important to take into account?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the factors considered when defining service availability goals?

A

What level of service will the users expect?

Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?

Is this a paid service, or is it free?

If there are competitors in the marketplace, what level of service do those competitors provide?

Is this service targeted at consumers, or at enterprises?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What questions should we ask when considering cost into downtime risk?

A

If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?

Does this additional revenue offset the cost of reaching that level of reliability?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the importance of Software fault tolerance?

A

How hardened do we make the software to unexpected events? Too little, and we have a brittle, unusable product. Too much, and we have a product no one wants to use (but that runs very stably).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the importance of testing?

A

Again, not enough testing and you have embarrassing outages, privacy data leaks, or a number of other press-worthy events. Too much testing, and you might lose your market or create brittle test flows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the importance of push frequency?

A

Every push is risky. How much should we work on reducing that risk, versus doing other work?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the importance of canary duration and size?

A

It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the error budget?

A

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the error budget calculated?

A

Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.

The actual uptime is measured by a neutral third party: our monitoring system.

The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.

As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly