CS7210 Flashcards
What is a distributed system?
A collection of computing units that interact by exchanging messages via an interconnection network and appear to external users as a single coherent computing facility
What is the CAP theorem?
A distributed system can be two of:
- Consistent
- Available
- Partitioned
What are some difficulties in client-server systems?
- Discovery and bundling
- Identifying the interface and parameter types
- Agreeing on the data representation
- Explicit data management
- Unpredictable delays
- Unknown cause of failures
- Must be explicitly handled
What are the goals of an RPC system?
Hide complexity of distributed programming, and make it look more similar to local node programming.
When we can look at two timestamps from a logical clock, and the timestamp of one event is greater than another event, if we can assume that the event with a greater timestamp happened later, we say that logical clock has ____ ____ ____
Strong clock consistency
What type of value do Lamport clocks for their timestamp?
A scalar value
Are Lamport clocks strongly consistent?
No, it is only consistent
In the context of logical clocks, what is the difference between consistency and strong consistency?
Consistent: Difference in the order of two events implies difference in timestamps, but not vice versa
Strongly Consistent: Difference in timestamps implies the order of two events
How does a vector clock work?
Each process maintains its own view of time at other nodes.
It keeps a vector (array) of values, which represent Lamport clocks for each node.
When a vector clock receives a message, what does it do with the passed in vector value?
For each value in the vector, it updates its own with the maximum of the value in its own or the message vector. Then, after triggering an event, it increments its own value in its own vector.
Suppose vector clock 0 has clock values [2,1,1] and receives a message with clock values [2,2,0]. What will its vector values be after processing the message?
[3,2,1]
Are vector clocks consistent?
Yes
Are vector clocks strongly consistent?
Yes
Does a Lamport clock require more space as more nodes are added?
No
Does a vector clock require more space as more nodes are added?
Yes, each node has its own element in the vector (array)
If a scalar clock represents a node’s belief about the current time, and a vector clock represents a node’s belief about each node’s belief about the current time, what does a matrix clock represent?
A node’s belief, about each node’s belief, about each node’s belief about the current time.
What benefit does a matrix clock provide over a vector clock?
It allows for garbage collection. If every node knows that every node has passed a time t , then every node knows about everything that has happened prior to t
What is the Global State of a distributed system comprised of?
- Processes and Channels
- Process state defined by most recent event
- Channel state defined by inflight messages
Each event in a distributed system changes the state of at least one of the entities, therefore changing the ____ of the distributed system.
State
In the context of distributed system state, what is a “run”?
A sequence of events. It can be actual, or observed.
In the context of state in a distributed system, what is a cut?
The state of the system at a point in time.
In the context of state in a distributed system, is a cut at the same point in time for all nodes?
No, it might be curved like this:
In the context of state in a distributed system, what is a consistent cut?
A snapshot that corresponds to some possible point in the execution which could have been represented by a consistent ordering of the events.
In the context of state in a distributed system, what is a prerecording event?
Events that occurred before a cut
In the context of state in a distributed system, what is a postrecording event?
Events that happen after a cut
Why is it difficult to capture state in a distributed system?
- Instant recording is impossible
- We can’t assume a global clock
- Networks have delays
- They’re not deterministic
How does the Chandy and Lamport snapshot algorithm to capture distributed system state work?
Does the Chandy and Lamport algorithm promise to give a consistent state?
Yes
In the context of distributed system state, what is a “stable property”?
Iff when it becomes true in a state S , it remains true for all states S’ reachable from S
What are some examples of Stable Properties?
Deadlock
Termination
In the context of distributed system state, what is an “unstable property”?
If when it becomes true in a state S , it is possibly (but not necessarily) true for a later state reachable from S
What is consensus?
Agreement among distributed processes about something, e.g. the outcome of a transaction
What are the key properties of a consensus?
All non-faulty processes eventually decide on a value, all processes decide on a single value, and the value that’s decided on must have been proposed by some process
In the context of consensus in distributed systems, what is an Admissible Run?
Run with 1 faulty processor and all messages eventually delivered (matches system model)
In the context of consensus in distributed systems, what is a Deciding Run?
An admissible run where some non faulty processors reach a decision
In the context of consensus in distributed systems, what is a Totally Correct Consensus Protocol?
When all admissible runs are also deciding runs
In the context of consensus in distributed systems, what is a uni-valent configuration?
When only one decision is possible
In the context of consensus in distributed systems, what is a bi-valent configuration?
When more than one decision is possible
What is the FLP theorem?
The Fischer/Lynch/Patterson theorem says that in a system with one fault, no consensus protocol can be totally correct.
How do existing consensus systems not violate the FLP theorem?
They change some of the assumptions and system properties
In distributed systems, what is the goal of replication?
To make state available at more than one node
What is the difference between active and standby (primary-backup) replication?
In active, both replicas can handle requests. In stand-by, one replica takes requests and others are kept consistent in preparation for a failover.
What are the two main techniques to implement replication?
State replication, and replicated state machine
What is the difference between state replication, and a replicated state machine?
State replication is where one replica processes a request, and then copies the new resulting state to other replicas.
Replicated state machine is where a copy of each operation is sent to and executed at each replica, to produce the same state update.
What are the pros and cons of the state replication approach to replication?
Pro: no need to re-execute multiple times
Con: state may be large or hard to identify where all updates are
What are the pros and cons of the replicated state machine approach to replication?
Pro: no need to send large state, operation logs may be smaller
Con: must re-execute, and it requires that the operations be deterministic
How does chain replication work?
The head node receives a write request, and forwards it down a chain of replicas. This prevents the head replica from having to interact with every other replica.
With chain replication, which replica serves read requests?
The tail replica (last one in the chain), because it ensures that all of the replicas have received the updates.
What are the pros and cons of chain replication?
Pros: the leader write node is scalable, it can handle a lot of writes, and it makes strong consistency possible
Cons: many workloads are read heavy, and inefficient in that middle nodes may be underutilized
How does CRAQ improve on chain replication?
It keeps both the old and new versions of data at each replica while the update propagates through the chain. If both versions are present, it checks with the tail to see if the new version has finished propagating. If it hasn’t, it returns the old version. This makes it possible to send reads to the replicas in the middle of the chain.
When a fault is ____ it leads to an error
activated
When an error ____ it becomes a failure
propogates
What is a transient fault?
A fault that appears and then disappears
What is an intermittent fault?
A fault that appears on and off
What is a permanent fault?
A fault that stays once it occurs
What is a fail-stop failure?
One or more components of the distributed system stop working/responding, like a crash
What is a timing failure?
The system components behave outside of some timing expectations, which can interfere with things like timeouts
What is an omission failure?
When some action is missing
What are some practical ways of managing failures?
Detection, Removal, and Recovery
What are a couple basic techniques for tracking state change so that a node can roll back to a known good state?
Checkpoints and logging
What is the checkpointing technique for tracking state change so that a node can roll back to a known good state?
Where the node periodically saves its state, so that it can reload from there
What is the logging technique for tracking state change so that a node can roll back to a known good state?
Where the node logs information about the operations performed, so that it can undo or redo changes. The log is kept on persistent storage, but the changes are smaller so less I/O time is needed. The downside is that recovery takes longer than it does in checkpointing.
How can we combine the checkpointing and logging techniques for recovering a failed node?
Checkpoint every so often, and keep logs since the last checkpoint
What are the three different approaches to determining when to capture a checkpoint?
Uncoordinated, coordinated, and communication-induced
What is uncoordinated checkpointing?
When processes take checkpoints independently
What are the risks/downsides of uncoordinated checkpointing?
You might have to go very far back before finding a combination of processes’ checkpoints that make up a consistent cut. aka domino effect
Processes may capture checkpoints that can’t be part of a consistent cut
May need to keep more than the most recent checkpoints
Creates the need to identify obsolete checkpoints
What is coordinated checkpointing?
When the processes coordinate their checkpoints so that they get a consistent cut.
What are the two kinds of communicaton-induced checkpoints?
Blocking: where an initiator uses 2-phase-commit or some other consensus algorithm to stop underlying application work and take a snapshot
Non blocking: Using the global snapshot algorithm, piggybacking the info instead of using a marker to eliminate the need for FIFO networking
What is the fundamental tradeoff of logging vs checkpointing?
Spending compute vs I/O
What is pessimistic logging technique for failure recovery?
Logging everything to persistent storage before allowing an event to propagate and commit
What is optimistic logging technique for failure recovery?
Assuming that log will be persisted before a failure occurs, but making it possible to remove effects if abort needed
What is causality-tracking logging technique for failure recovery?
Ensuring causally related events are deterministically recorded