Risk Flashcards

Question 1

Q

The cost of redundant machine/compute resources

Answer

A

The cost associated with redundant equipment that, for example, allows us to take systems offline for routine or unforeseen maintenance, or provides space for us to store parity code blocks that provide a minimum data durability guarantee.

Question 2

Q

The opportunity cost

Answer

A

The cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.

Question 3

Q

Time-based availability

Answer

A

uptime / (uptime + downtime)

Question 4

Q

How is service availability expressed?

Answer

A

the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability.

Question 5

Q

Aggregate availability

Answer

A

successful requests / total requests

Question 6

Q

How is risk tolerance defined?

Answer

A

On a service level by working with Product Owners

Question 7

Q

What are the factors considered when defining risk tolerance?

Answer

A

What level of availability is required?

Do different types of failures have different effects on the service?

How can we use the service cost to help locate a service on the risk continuum?

What other service metrics are important to take into account?

Question 8

Q

What are the factors considered when defining service availability goals?

Answer

A

What level of service will the users expect?

Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?

Is this a paid service, or is it free?

If there are competitors in the marketplace, what level of service do those competitors provide?

Is this service targeted at consumers, or at enterprises?

Question 9

Q

What questions should we ask when considering cost into downtime risk?

Answer

A

If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?

Does this additional revenue offset the cost of reaching that level of reliability?

Question 10

Q

What is the importance of Software fault tolerance?

Answer

A

How hardened do we make the software to unexpected events? Too little, and we have a brittle, unusable product. Too much, and we have a product no one wants to use (but that runs very stably).

Question 11

Q

What is the importance of testing?

Answer

A

Again, not enough testing and you have embarrassing outages, privacy data leaks, or a number of other press-worthy events. Too much testing, and you might lose your market or create brittle test flows.

Question 12

Q

What is the importance of push frequency?

Answer

A

Every push is risky. How much should we work on reducing that risk, versus doing other work?

Question 13

Q

What is the importance of canary duration and size?

Answer

A

It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?

Question 14

Q

What is the error budget?

Answer

A

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

Question 15

Q

How is the error budget calculated?

Answer

A

Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.

The actual uptime is measured by a neutral third party: our monitoring system.

The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.

As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.