Module 10a - Fault Tolerance Flashcards

1
Q

Describe the concept of Availability for fault tolerance in distributed computing

A

The system should operate correctly at any given instant in time.
Ex: a real system may be 99% available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the concept of Reliability in distributed computing

A

The system should run continuously without interruption.

Ex: A real system may have a mean time between failures (MTBF) of one month

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the concept of Safety in distributed computing

A

Failure of the system should not have catastrophic consequences.
Ex: your car can still come to a complete stop if the ABS fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the concept of Maintainability in distributed computing

A

A failed system should be easy to repair.

Ex: disks can be replaced easily in a RAID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define the term “error” in distributed computing

A

Error: A part of a system’s state that might lead to a failure.
Ex: dropped or damaged network packet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A ____ may lead to an _____ which may lead to a _____

A

fault
error
failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define the term “fault” in distributed computing

A

Fault: The cause of an error
Ex: When a person talking on the phone walks into an elevator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 3 types of faults?

A

Transient faults
Intermittent faults
Permanent faults

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a transient fault?

A

Transient faults occur once and then disappears

Ex: a bird flies in front of a microwave receiver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an intermittent fault?

A

Intermittent faults occur, vanish, then reappear. They are difficult to debug.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a permanent fault?

A

Permanent faults occur and will continue to exist until a faulty component is replaced.

Ex: burnt out power supply in a server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 5 types of failures in distributed systems?

A
Crash failure
Omission failure
Timing failure
Response failure
Arbitrary failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Crash failure?

A

A server halts, but is working correctly until it halts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an Omission failure?

A

A server fails to respond to incoming requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Timing failure?

A

A server’s response lies outside the specified time interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Response failure?

A

A server’s response is incorrect

17
Q

What is an Arbitrary Failure?

A

A server may produce arbitrary responses at arbitrary times

18
Q

In Distributed Systems, we mask failures using _______. One example of this is _______ computing units

A

redundancy

replicating

19
Q

A software technique for providing redundancy is to create a group of redundant identical processes. They can be classified as a Flat group, or a Hierarchical group.

Describe both of these paradigms.

A

Flat Group:
All processes behave in the same way. They are simply replicas of each other.

Hierarchical Group:
There is 1 coordinator processes, and numerous worker processes.

20
Q

What are Flat groups? how can you implement them?

A

All processes play an equal role, there is no concept of a primary, or a backup.

Implemented using quorum-replications.

21
Q

What are Hierarchical groups?

A

There is a distinguished primary/coordinator node, which coordinates the actions of the other nodes which are backup/worker nodes.

Implemented using primary-backup server. Consensus problem is required to be solved for this

22
Q

What is the consensus problem?

A
  • We assume that each process has a procedure propose(val) and a procedure decide()
  • First, each process proposes a value by calling the propose function once and specifying a value at the initial state of the process
  • Next, each process learns the value agreed upon by calling the decide() function
23
Q

3 friends are trying to figure out what to do on a Friday night. Their proposals are all different activities:
Friend #1: proposes to go see a movie
Friend #2: proposes to go to a restaurant
Friend #3: proposes to sit at home

Next, the 3 friends all decide and go to the movie.

This is an analogy of what type of problem?

A

The Consensus Problem

24
Q

In the consensus problem, there are 2 safety properties: Agreement, and Validity. Describe them.

A

Agreement: Two calls to decide() never return different values

Validity: If a process calls decide() with response x, then some process must have invoked a call to propose(x)

25
Q

In the consensus problem, what is the liveness property?

A

Liveness: is a process calls propose(x) or decide() and does NOT fail, then that process must eventually terminate

26
Q

What are the names of the 3 properties of the consensus problem?

A

Safety Agreement
Safety Validity
Liveness Property

27
Q

Solving consensus in a failure-prone distributed environment depends on 4 design factors. What are they?

Note: these 4 factors influence if a consensus problem is solvable

A
  1. Async vs Sync processes (is the time for an execution bounded)
  2. Communication Delays (is there a bound with network delay on message delivery)
  3. Message delivery Order (FIFO vs LIFO)
  4. Unicast vs multicast
28
Q

RPC systems may exhibit 5 classes of failure scenarios. What are they?

A
  1. Client is unable to locate the server (url doesn’t resolve to network address or network address doesn’t give a connection)
  2. The request message from the client to the server is lost
  3. The server crashes after receiving a request
  4. The reply message from the server to the client is lost
  5. The client crashes after sending a request
29
Q

What does it mean for a request to be “idempotent”?

A

Repeated executions have the same effect as one execution.

30
Q

When an RPC server crashes upon request, often the request can be reissued. What are the effects of this? In what case is this fine?

A

The request may be processed multiple times by the service handler. The client may not know how much of the execution transpired. This is fine if the execution is idempotent

31
Q

Sometimes when an RPC server crashes upon request, a strategy can be to ______ and report a failure. There is no ______ that the request has been processed.

A

give-up

guarantee

32
Q

Suppose an RPC server crashes upon request. What do the following terms correspond to:

  1. at-least-one semantics
  2. at-most-one semantics
  3. exactly one semantics
A
  1. at-least-one semantics is when the request is reissued
  2. at-most-one semantics is when the client gives up
  3. exactly one semantics is when the client determines if the request is processed and reissues accordingly
33
Q

What does “exactly one semantics” mean in the context of an RPC server crashing upon receiving a request?
Why is this scheme difficult to implement?

A

The client determines if the request is processed and reissues accordingly.

It is difficult to implement because the server may not have a way of knowing wether a particular action has been completed.

34
Q

Suppose we have a client-server RPC set tup. The client follows an always re-issue strategy if it does not receive a response.

In what case would this lead to duplicated computation from the server?

A

If the server executes & completes the request, and then crashes right before sending the response. The client would reissue the request and the server would compute the request twice

35
Q

What is the definition of fault tolerence?

A

the characteristic by which a system can mask the occurrence and recovery from failures.