7 - Fault Tolerance Flashcards by Josh Davidson

Dependability

Correctness of one component depends on another component

How well did you know this?

Not at all

Perfectly

Dependability Example

Web server W requires database D.
D fails, so does W.

How well did you know this?

Not at all

Perfectly

Dependability requires

Availability
Reliability
Safety
Maintainability

How well did you know this?

Not at all

Perfectly

Availability

Readiness for usage
A(t) = probability that component is immediately usable

How well did you know this?

Not at all

Perfectly

Reliability

Continuity of service delivery
R(t) = probability component works over time period [0,t]

How well did you know this?

Not at all

Perfectly

Safety

Low risk of component failure leading to system failure.

How well did you know this?

Not at all

Perfectly

Maintainability

Easy and quick to repair

How well did you know this?

Not at all

Perfectly

When is a system more available but less reliable

more available meaning generally online but less reliable meaning scattered downtime

How well did you know this?

Not at all

Perfectly

MTTF

Mean Time to Failure
Average time between start and failure

How well did you know this?

Not at all

Perfectly

MTTR

Mean Time to Repair
Average time of repair

How well did you know this?

Not at all

Perfectly

MTBF

Mean Time between Failures
MTBF = MTTF + MTTR

How well did you know this?

Not at all

Perfectly

Availability (expression in relation to MTTF MTBF etc)

time available/whole time
MTTF/MTBF

How well did you know this?

Not at all

Perfectly

Failure Rate z(t) of component C

z(t) is the (conditional) probability that C initially fails at t.

How well did you know this?

Not at all

Perfectly

Reliability R(t) of component C

Declines exponentially even if z(t) is constant.
R(t) = e^(-z*t)

How well did you know this?

Not at all

Perfectly

Density Function

f(t)
Probability of fail at

How well did you know this?

Not at all

Perfectly

CDF

Cumulative Distribution Function
Probability to fail by t

How well did you know this?

Not at all

Perfectly

Fault classifications

Permanent: exists until repaired
Intermittent: reappearing
Transient: occurs just once and disappears

How well did you know this?

Not at all

Perfectly

Fault

Cause of error

How well did you know this?

Not at all

Perfectly

Error

Part of component that can lead to failure

How well did you know this?

Not at all

Perfectly

Failure

Component is not running as expected

How well did you know this?

Not at all

Perfectly

Dealing with faults

Tolerant system
Good fault removal
Fault forecasting

How well did you know this?

Not at all

Perfectly

Are parallel programs fault tolerant?

Often not fault-tolerant.
Wastes resources
Crashes as soon as one component fails

How well did you know this?

Not at all

Perfectly

Checkpointing

Study These Flashcards

Save program state in regular intervals and pick up from last one after a crash.

Avoid unneccessary checkpoints - predict crash

Failure Types

(one word names)

Study These Flashcards

Halting
Omission
Timing
Response
Arbitrary/byzantine

Halting failure

Component stops

Omission failure

Failure to respond/receive/send messages

Timing failure

Response outside of specified time interval

Response failures

Incorrect response. Value or state-transition

Arbitrary/byzantine failure

Arbitrary responses at arbitrary times

Async or sync: Reliable crash failure detection

Async - cannot be reliably detected Sync - can be reliably detected, detect by timeout but NO CERTAINTY

Halt fail types

stop - reliably detectable noisy - eventually reliably detectable silent - client cannot distinguish crash/omission safe - benign arbitrary - unobservable and harmful

Information Redundancy

Add extra error detection bits, eg checksum

Time redundancy

Designed to repeat action if required

Physical Redundancy

Add replicas of components

Triple Modular Redundancy

Every device replicated 3 times Each replicate should produce same result. Majority value is used

Process Groups

Dynamic group of (mostly) identical processes to achieve fault tolerance via physical redundancy. All members receive all messages

Process Group examples

Replicated data store Master-worker scheme

Hierarchical Group

One member is coordinator. If coordinator fails, new one must be elected.

Flat Group

Members coordinate together Majority decisions Robust against crashes

Process Groups: Centralised membership

Group server manages. - Efficient, Simple removal, Message sync easier - Single point of failure, group must be reconstructed if server ceases

Process Groups: Distributed membership

Processes inform all of join/leave - More fault tolerant, Many processes must fail before rebuild - Less efficient, crashed processes must be detected, message sync harder

k-fault-tolerant group

A group can mask any k concurrent member failures k is called degree of fault tolerance

k-fault-tolerant group needs to be how large for halting failures...?

k+1 because one member is enough

k-fault-tolerant groups needs to be how large for arbitrary/byzantine failures

2*k +1 are needed because a majority vote is required

Consensus

All processes in a process group agree

Flooding based consensus: Algorithm for each process

Pi sends its set of commands to all others. For each pair (pi,pj), Pi detects either a Pj message or crash. Pi merges all received commands If Pi detects a crash. - Pi does not exec a command. - Pi runs error routine in next round Else: - Pi selects next command and exec

Flooding based consensus: Algorithm after failure

If Pi received all messages, but others did not: - Pi sends msgs and decision from previous round to other processes Elseif Pi did not: - Pi signlas that it did not receive all - Pi runs command from last round

Byzantine Consensus: Context

Some generals in the Byzantine Empire may be traitors (faulty processes) and deliberately send false messages

Byzantine Consensus: Attack or retreat?

If all (honest) retreat, it is fine. If all (honest) attack, it will succeed. If a subset attacks, it fails.

Generally, Byzantine Consensus can be reached when...

if k processes are faulty, then at least n = 3*k+1 are required

Failure Detection Types/Methods

Passive Active Gossiping

How is a byzantine failure caused?

By a faulty/malicious process producing a false value

How many byzantine failures can be tolerated in Triple Module Redundancy (3 voting devices)?

1. If there's 3 devices and 1 gives false results, the other 2 will still agree and create a majority

How many crash failures can Triple Modular Redundancy tolerate? and why?

It can tolerate 2. Crashed devices don't return ANY results so it leaves 1 functioning voter and hence, a majority.

How many byzantine failures can a system of n voters tolerate?

(n-1)/2 failures

How many crash failures can be tolerated by a system of n voters?

n-1 crashes

7 - Fault Tolerance Flashcards

(56 cards)