chapter 7 Flashcards
what is Dependability of components.
A component C depends on C∗ if the correctness of C’s behavior depends on the correctness of C∗’s behavior.
what are reqs for dependability? list and describe them.
Availability - readiness to be used
Reliability - Continuity of service delivery
Safety - Very low probability of catastrophes
Maintainability - How easy can a failed system be repaired
what is Reliability R(t)?
probability that a component has been up and running continuously in the time interval [0,t)
what are the traditional metrics to measure realiability?
- Mean Time To Failure (MTTF): Average time until a component fails
- Mean Time To Repair(MTTR): Average time it takes to repair a failed component.
- Mean Time Between Failures(MTBF): MTTF + MTTR
what is Availability A(t)?
Average fraction of time that a component has been up and running in the interval [0,t)
how can we calculate Availability A(t)?
A = MTTF /MTBF = MTTF /(MTTF + MTTR )
describe faliure and give example
- May occur when a component is not living up to its specifications.
– A crashed program
describe error and give example
- Part of a component that may lead to a failure
– A programming bug
describe fault and give example
- The cause of an error
– A sloppy programmer
describe fault prevention and give example
- Prevent the occurrence of a fault
– Don’t hire sloppy programmers
describe fault tolerance and give example
- Build a component that can mask the occurrence of a fault
– Build each component by two independent programmers
describe fault removal and give example
- Reduce the presence, number, or seriousness of a fault
– Get rid of sloppy programmers
describe Fault forecasting and give example
- Estimate current presence, future incidence, and consequences of faults
– Estimate how a recruiter is doing when it comes to hiring sloppy programmers
what is a Crash failure?
Component halts, but behaves correctly before halting
what is an Omission failure?
- Failure in sending or receiving messages
– Receiving omissions: sent messages are not received
– Send omissions: messages are not sent that should have
what is a Timing failure?
- Output [ response ] is correct, but lies outside a specified interval.
– Performance failures: the component is too slow
what is a Response failure?
- The components response is incorrect
– Value failure : The value of the response is wrong
– State transition failure : The server deviates from the correct flow of control and into a wrong state
what is an Arbitrary failure?
Component produces arbitrary output and be subject to arbitrary timing failures
what is a Commission failure?
A component takes an action that it should not have taken
what is a Deliberate failure
can be omission or commission failures, that stretch out to the field of security
describe if possible, how we can Distinguishing between a crash or omission/timing failure.
- Asynchronous system: no assumptions about process execution speeds or message delivery times → cannot reliably detect crash failures.
- Synchronous system: process execution speeds and message delivery times are bounded → we can reliably detect omission and timing failures.
- Partially synchronous systems: most of the time, we can assume the system to be synchronous, yet there is no bound on the time that a system is asynchronous → can normally reliably detect crash failures.
what assumptions can we make about crash failures?
- Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts)
- Fail-noisy: Crash failures, eventually reliably detectable
- Fail-silent: The component exhibits omission or crash failures; clients
cannot tell what went wrong - Fail-safe: The component exhibits arbitrary, but benign failures (they can’t do any harm)
- Fail-arbitrary Arbitrary, with malicious failures
what is the Basic approach to Process Resilience?
replicate a process and organize them into a group; if a
process in the group fails the others take over.
what are the 2 techniques to achieve process resilience?
flat groups
hierarchical groups
describe a flat group
- all processes are equal
- good for fault tolerance since information exchange immediately occurs with all group members
- imposes higher overhead as control is completely distributed
- hard to implement
describe a Hierarchical group.
- All communication go through a single coordinator
- Loss of the coordinator brings the entire group to a
halt - not really fault tolerant and scalable
- easier to implement
what is a K-fault tolerant group?
When a group can mask any k concurrent member failures (k is called degree of fault tolerance).
How large does a k-fault tolerant group need to be?
- Assume crash/performance failure semantics ⇒ a total of k + 1 members
are needed to survive k member failures. - Assume arbitrary failure semantics, and group output defined by voting ⇒ a
total of 2k + 1 members are needed to survive k member failures.
describe the assumptions and basic idea of Flooding-based consensus
Assume:
- Fail-stop semantics - when a process crashes, this can be reliably detected.
- Reliable failure detection - a process P can indeed reliably detect that Q crashed
- Unreliable communication
Basic idea:
- A client contacts a Pi requesting it to execute a command
- Every Pi maintains a list of proposed commands
- A process group P = {P1,…,Pn}
- In round r, Pi multicasts its known set of commands C to all other processes
what are the assumptions made by Paxos consensus?
- An asynchronous system
- Communication may be unreliable (meaning that messages may be lost, duplicated, or reordered)
- Corrupted messages are detectable (and can thus be discarded)
- All operations are deterministic ( can’t be interrupted )
- Process may exhibit halting failures, but not arbitrary failures, nor do they collude.
what are the Essentials of a Paxos consensus?
- client - a thread that requests to have an operation performed
- proposer - a thread that takes a client’s request and attempts to have the requested operation accepted for execution
- acceptor - a thread that operates in a quorum to vote for the execution of an operation
- learner - a thread that eventually performs an operation
what are the guarantees of a Paxos consensus?
- Safety (nothing bad will happen):
- Only proposed operations will be learned
- At most one operation will be learned (and subsequently executed before a next operation is learned) - Liveness (something good will eventually happen):
- If sufficient processes remain non faulty, then a proposed operation will eventually be learned
How can we reliably detect that a process has actually crashed using the general model?
- Each process is equipped with a failure detection module
- A process p probes another process q for a reaction:
—- q reacts →q is alive
—- q does not react within t time units → q is suspected to have crashed
Note: in a synchronous system:
- a suspected crash is a known crash
- referred to as a perfect failure detector
what do we call a perfect failure detector in practice? and what are it’s two important properties?
- the eventually perfect failure detector
- Strong completeness : every crashed process is eventually suspected to have crashed by every correct process.
- Eventual strong accuracy : eventually, no correct process is suspected by any other correct process to have crashed.
what is the implementation of the eventually perfect failure detector?
- If p did not receive heartbeat from q within time t → p suspects q.
- If q later sends a message (received by p):
—- p stops suspecting q
—- p increases timeout value t - Note: if q does crash, p will keep suspecting q.
What can go wrong during RPC communication?
1: Client cannot locate server
2: Client request is lost
3: Server crashes
4: Server response is lost
5: Client crashes
what are the RPC communication: Solutions 1 and 2?
1: report back to client
2: Just resend message
what are the RPC communication: Solutions 3?
3: We need to decide on what we expect from the server:
A. At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. [ read ]
B. At-most-once-semantics: The server guarantees it will carry out an operation at most once. [ write, transfer 10k ]
what are the RPC communication: Solutions 4?
4: Detecting lost replies can be hard, because it can also be that the server had crashed. You don’t know whether the server has carried out the operation
Solution: None, except that you can try to make your operations:
- idempotent: repeatable without any harm done if it happened to be carried out before.
what is an orphan computation?
Client crashes but The server is doing work and holding resources for nothing
what are the RPC communication: Solutions 5?
- Orphan is killed (or rolled back) by client when it reboots
- Broadcast new epoch number when recovering ⇒ servers kill orphans
- Require computations to complete in a T time units. Old ones are simply removed.