chapter 7 Flashcards

1
Q

what is Dependability of components.

A

A component C depends on C∗ if the correctness of C’s behavior depends on the correctness of C∗’s behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are reqs for dependability? list and describe them.

A

Availability - readiness to be used
Reliability - Continuity of service delivery
Safety - Very low probability of catastrophes
Maintainability - How easy can a failed system be repaired

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is Reliability R(t)?

A

probability that a component has been up and running continuously in the time interval [0,t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the traditional metrics to measure realiability?

A
  • Mean Time To Failure (MTTF): Average time until a component fails
  • Mean Time To Repair(MTTR): Average time it takes to repair a failed component.
  • Mean Time Between Failures(MTBF): MTTF + MTTR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is Availability A(t)?

A

Average fraction of time that a component has been up and running in the interval [0,t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how can we calculate Availability A(t)?

A

A = MTTF /MTBF = MTTF /(MTTF + MTTR )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

describe faliure and give example

A
  • May occur when a component is not living up to its specifications.
    – A crashed program
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

describe error and give example

A
  • Part of a component that may lead to a failure
    – A programming bug
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

describe fault and give example

A
  • The cause of an error
    – A sloppy programmer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

describe fault prevention and give example

A
  • Prevent the occurrence of a fault
    – Don’t hire sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

describe fault tolerance and give example

A
  • Build a component that can mask the occurrence of a fault
    – Build each component by two independent programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

describe fault removal and give example

A
  • Reduce the presence, number, or seriousness of a fault
    – Get rid of sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

describe Fault forecasting and give example

A
  • Estimate current presence, future incidence, and consequences of faults
    – Estimate how a recruiter is doing when it comes to hiring sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is a Crash failure?

A

Component halts, but behaves correctly before halting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is an Omission failure?

A
  • Failure in sending or receiving messages
    – Receiving omissions: sent messages are not received
    – Send omissions: messages are not sent that should have
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a Timing failure?

A
  • Output [ response ] is correct, but lies outside a specified interval.
    – Performance failures: the component is too slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is a Response failure?

A
  • The components response is incorrect
    – Value failure : The value of the response is wrong
    – State transition failure : The server deviates from the correct flow of control and into a wrong state
18
Q

what is an Arbitrary failure?

A

Component produces arbitrary output and be subject to arbitrary timing failures

19
Q

what is a Commission failure?

A

A component takes an action that it should not have taken

20
Q

what is a Deliberate failure

A

can be omission or commission failures, that stretch out to the field of security

21
Q

describe if possible, how we can Distinguishing between a crash or omission/timing failure.

A
  1. Asynchronous system: no assumptions about process execution speeds or message delivery times → cannot reliably detect crash failures.
  2. Synchronous system: process execution speeds and message delivery times are bounded → we can reliably detect omission and timing failures.
  3. Partially synchronous systems: most of the time, we can assume the system to be synchronous, yet there is no bound on the time that a system is asynchronous → can normally reliably detect crash failures.
22
Q

what assumptions can we make about crash failures?

A
  • Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts)
  • Fail-noisy: Crash failures, eventually reliably detectable
  • Fail-silent: The component exhibits omission or crash failures; clients
    cannot tell what went wrong
  • Fail-safe: The component exhibits arbitrary, but benign failures (they can’t do any harm)
  • Fail-arbitrary Arbitrary, with malicious failures
23
Q

what is the Basic approach to Process Resilience?

A

replicate a process and organize them into a group; if a
process in the group fails the others take over.

24
Q

what are the 2 techniques to achieve process resilience?

A

flat groups
hierarchical groups

25
Q

describe a flat group

A
  • all processes are equal
  • good for fault tolerance since information exchange immediately occurs with all group members
  • imposes higher overhead as control is completely distributed
  • hard to implement
26
Q

describe a Hierarchical group.

A
  • All communication go through a single coordinator
  • Loss of the coordinator brings the entire group to a
    halt
  • not really fault tolerant and scalable
  • easier to implement
27
Q

what is a K-fault tolerant group?

A

When a group can mask any k concurrent member failures (k is called degree of fault tolerance).

28
Q

How large does a k-fault tolerant group need to be?

A
  • Assume crash/performance failure semantics ⇒ a total of k + 1 members
    are needed to survive k member failures.
  • Assume arbitrary failure semantics, and group output defined by voting ⇒ a
    total of 2k + 1 members are needed to survive k member failures.
29
Q

describe the assumptions and basic idea of Flooding-based consensus

A

Assume:
- Fail-stop semantics - when a process crashes, this can be reliably detected.
- Reliable failure detection - a process P can indeed reliably detect that Q crashed
- Unreliable communication

Basic idea:
- A client contacts a Pi requesting it to execute a command
- Every Pi maintains a list of proposed commands
- A process group P = {P1,…,Pn}
- In round r, Pi multicasts its known set of commands C to all other processes

30
Q

what are the assumptions made by Paxos consensus?

A
  • An asynchronous system
  • Communication may be unreliable (meaning that messages may be lost, duplicated, or reordered)
  • Corrupted messages are detectable (and can thus be discarded)
  • All operations are deterministic ( can’t be interrupted )
  • Process may exhibit halting failures, but not arbitrary failures, nor do they collude.
31
Q

what are the Essentials of a Paxos consensus?

A
  1. client - a thread that requests to have an operation performed
  2. proposer - a thread that takes a client’s request and attempts to have the requested operation accepted for execution
  3. acceptor - a thread that operates in a quorum to vote for the execution of an operation
  4. learner - a thread that eventually performs an operation
32
Q

what are the guarantees of a Paxos consensus?

A
  1. Safety (nothing bad will happen):
    - Only proposed operations will be learned
    - At most one operation will be learned (and subsequently executed before a next operation is learned)
  2. Liveness (something good will eventually happen):
    - If sufficient processes remain non faulty, then a proposed operation will eventually be learned
33
Q

How can we reliably detect that a process has actually crashed using the general model?

A
  • Each process is equipped with a failure detection module
  • A process p probes another process q for a reaction:
    —- q reacts →q is alive
    —- q does not react within t time units → q is suspected to have crashed

Note: in a synchronous system:
- a suspected crash is a known crash
- referred to as a perfect failure detector

34
Q

what do we call a perfect failure detector in practice? and what are it’s two important properties?

A
  • the eventually perfect failure detector
  1. Strong completeness : every crashed process is eventually suspected to have crashed by every correct process.
  2. Eventual strong accuracy : eventually, no correct process is suspected by any other correct process to have crashed.
35
Q

what is the implementation of the eventually perfect failure detector?

A
  • If p did not receive heartbeat from q within time t → p suspects q.
  • If q later sends a message (received by p):
    —- p stops suspecting q
    —- p increases timeout value t
  • Note: if q does crash, p will keep suspecting q.
36
Q

What can go wrong during RPC communication?

A

1: Client cannot locate server
2: Client request is lost
3: Server crashes
4: Server response is lost
5: Client crashes

37
Q

what are the RPC communication: Solutions 1 and 2?

A

1: report back to client
2: Just resend message

38
Q

what are the RPC communication: Solutions 3?

A

3: We need to decide on what we expect from the server:

A. At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. [ read ]

B. At-most-once-semantics: The server guarantees it will carry out an operation at most once. [ write, transfer 10k ]

39
Q

what are the RPC communication: Solutions 4?

A

4: Detecting lost replies can be hard, because it can also be that the server had crashed. You don’t know whether the server has carried out the operation

Solution: None, except that you can try to make your operations:

  • idempotent: repeatable without any harm done if it happened to be carried out before.
40
Q

what is an orphan computation?

A

Client crashes but The server is doing work and holding resources for nothing

41
Q

what are the RPC communication: Solutions 5?

A
  • Orphan is killed (or rolled back) by client when it reboots
  • Broadcast new epoch number when recovering ⇒ servers kill orphans
  • Require computations to complete in a T time units. Old ones are simply removed.