Module 10a - Fault Tolerance Flashcards

Question 1

Q

Describe the concept of Availability for fault tolerance in distributed computing

Answer

A

The system should operate correctly at any given instant in time.
Ex: a real system may be 99% available

Question 2

Q

Describe the concept of Reliability in distributed computing

Answer

A

The system should run continuously without interruption.

Ex: A real system may have a mean time between failures (MTBF) of one month

Question 3

Q

Describe the concept of Safety in distributed computing

Answer

A

Failure of the system should not have catastrophic consequences.
Ex: your car can still come to a complete stop if the ABS fails

Question 4

Q

Describe the concept of Maintainability in distributed computing

Answer

A

A failed system should be easy to repair.

Ex: disks can be replaced easily in a RAID

Question 5

Q

Define the term “error” in distributed computing

Answer

A

Error: A part of a system’s state that might lead to a failure.
Ex: dropped or damaged network packet

Question 6

Q

A ____ may lead to an _____ which may lead to a _____

Answer

A

fault
error
failure

Question 7

Q

Define the term “fault” in distributed computing

Answer

A

Fault: The cause of an error
Ex: When a person talking on the phone walks into an elevator

Question 8

Q

What are the 3 types of faults?

Answer

A

Transient faults
Intermittent faults
Permanent faults

Question 9

Q

What is a transient fault?

Answer

A

Transient faults occur once and then disappears

Ex: a bird flies in front of a microwave receiver

Question 10

Q

What is an intermittent fault?

Answer

A

Intermittent faults occur, vanish, then reappear. They are difficult to debug.

Question 11

Q

What is a permanent fault?

Answer

A

Permanent faults occur and will continue to exist until a faulty component is replaced.

Ex: burnt out power supply in a server

Question 12

Q

What are the 5 types of failures in distributed systems?

Answer

A

Crash failure
Omission failure
Timing failure
Response failure
Arbitrary failure

Question 13

Q

What is a Crash failure?

Answer

A

A server halts, but is working correctly until it halts

Question 14

Q

What is an Omission failure?

Answer

A

A server fails to respond to incoming requests

Question 15

Q

What is a Timing failure?

Answer

A

A server’s response lies outside the specified time interval

Question 16

Q

What is a Response failure?

Answer

A

A server’s response is incorrect

Question 17

Q

What is an Arbitrary Failure?

Answer

A

A server may produce arbitrary responses at arbitrary times

Question 18

Q

In Distributed Systems, we mask failures using _______. One example of this is _______ computing units

Answer

A

redundancy

replicating

Question 19

Q

A software technique for providing redundancy is to create a group of redundant identical processes. They can be classified as a Flat group, or a Hierarchical group.

Describe both of these paradigms.

Answer

A

Flat Group:
All processes behave in the same way. They are simply replicas of each other.

Hierarchical Group:
There is 1 coordinator processes, and numerous worker processes.

Question 20

Q

What are Flat groups? how can you implement them?

Answer

A

All processes play an equal role, there is no concept of a primary, or a backup.

Implemented using quorum-replications.

Question 21

Q

What are Hierarchical groups?

Answer

A

There is a distinguished primary/coordinator node, which coordinates the actions of the other nodes which are backup/worker nodes.

Implemented using primary-backup server. Consensus problem is required to be solved for this

Question 22

Q

What is the consensus problem?

Answer

A

We assume that each process has a procedure propose(val) and a procedure decide()
First, each process proposes a value by calling the propose function once and specifying a value at the initial state of the process
Next, each process learns the value agreed upon by calling the decide() function

Question 23

Q

3 friends are trying to figure out what to do on a Friday night. Their proposals are all different activities:
Friend #1: proposes to go see a movie
Friend #2: proposes to go to a restaurant
Friend #3: proposes to sit at home

Next, the 3 friends all decide and go to the movie.

This is an analogy of what type of problem?

Answer

A

The Consensus Problem

Question 24

Q

In the consensus problem, there are 2 safety properties: Agreement, and Validity. Describe them.

Answer

A

Agreement: Two calls to decide() never return different values

Validity: If a process calls decide() with response x, then some process must have invoked a call to propose(x)

Question 25

Q

In the consensus problem, what is the liveness property?

Answer

A

Liveness: is a process calls propose(x) or decide() and does NOT fail, then that process must eventually terminate

Question 26

Q

What are the names of the 3 properties of the consensus problem?

Answer

A

Safety Agreement
Safety Validity
Liveness Property

Question 27

Q

Solving consensus in a failure-prone distributed environment depends on 4 design factors. What are they?

Note: these 4 factors influence if a consensus problem is solvable

Answer

A

Async vs Sync processes (is the time for an execution bounded)
Communication Delays (is there a bound with network delay on message delivery)
Message delivery Order (FIFO vs LIFO)
Unicast vs multicast

Question 28

Q

RPC systems may exhibit 5 classes of failure scenarios. What are they?

Answer

A

Client is unable to locate the server (url doesn’t resolve to network address or network address doesn’t give a connection)
The request message from the client to the server is lost
The server crashes after receiving a request
The reply message from the server to the client is lost
The client crashes after sending a request

Question 29

Q

What does it mean for a request to be “idempotent”?

Answer

A

Repeated executions have the same effect as one execution.

Question 30

Q

When an RPC server crashes upon request, often the request can be reissued. What are the effects of this? In what case is this fine?

Answer

A

The request may be processed multiple times by the service handler. The client may not know how much of the execution transpired. This is fine if the execution is idempotent

Question 31

Q

Sometimes when an RPC server crashes upon request, a strategy can be to ______ and report a failure. There is no ______ that the request has been processed.

Answer

A

give-up

guarantee

Question 32

Q

Suppose an RPC server crashes upon request. What do the following terms correspond to:

at-least-one semantics
at-most-one semantics
exactly one semantics

Answer

A

at-least-one semantics is when the request is reissued
at-most-one semantics is when the client gives up
exactly one semantics is when the client determines if the request is processed and reissues accordingly

Question 33

Q

What does “exactly one semantics” mean in the context of an RPC server crashing upon receiving a request?
Why is this scheme difficult to implement?

Answer

A

The client determines if the request is processed and reissues accordingly.

It is difficult to implement because the server may not have a way of knowing wether a particular action has been completed.

Question 34

Q

Suppose we have a client-server RPC set tup. The client follows an always re-issue strategy if it does not receive a response.

In what case would this lead to duplicated computation from the server?

Answer

A

If the server executes & completes the request, and then crashes right before sending the response. The client would reissue the request and the server would compute the request twice

Question 35

Q

What is the definition of fault tolerence?

Answer

A

the characteristic by which a system can mask the occurrence and recovery from failures.