Midterm Review Flashcards by Allison Urban

What are the defining characteristics of a distributed system?

A collection of computing units that interact by exchanging messages via an interconnection network and appear to external users as a single coherent computing facility (e.g., there is a common goal that the system must accomplish and all system components must contribute to it)

How well did you know this?

Not at all

Perfectly

What are some simple models to represent a DS? Why would you choose to use a system model?

A simple model is composed of two nodes that exchange messages.

A system model is composed of:
- two or more nodes
- connected via communication channels
- send and receive messages
- each node contributes to the overall system state

A system model allows us to analyze system behavior w/o having to build prototypes and perform experimental validation under all scenarios.

How well did you know this?

Not at all

Perfectly

What do we mean by asynchronous? What about asynchrony makes distributed computing hard?

Asynchronous means not coordinated relative to time. In a synchronous system we can expect messages to be delivered instantly or within a fixed amount of time. In an asynchronous system message delivery is unpredictable.

Asynchrony makes distributed computing hard because we have to design to withstand message loss, reordering, and unpredictable delay.

How well did you know this?

Not at all

Perfectly

What about failures makes distributed computing hard?

There are different types of failures:
- Failstop failure - something just stops working
- Transient failure - something temporarily fails
- Byzantine failure - system is misbehaving. The system hasn’t failed but it’s performing incorrect actions.

It’s hard to understand whether there’s a failure, what type of failure, which component has failed…presumably because who decides what’s “correct”?

How well did you know this?

Not at all

Perfectly

Why is consistency a hard problem in distributed computing?

Consistency means we have a single and up-to-date copy of any data or state that’s part of the distributed system and that all nodes will have agreement on what that data/state is.

Factors that influence consistency:
- concurrency, ordering of operations
- data replication, caching

These factors introduce tradeoffs for system performance and failure tolerance

How well did you know this?

Not at all

Perfectly

Pick one of the Fallacies. Can you explain why it’s not true and how does it make distributed computing more challenging?

TODO

How well did you know this?

Not at all

Perfectly

Can you explain “The system gives correct answers always, regardless of failures”? What does it mean with respect to the required properties of the system?

We want the system to have consistency, availability and fault tolerance.

Required properties of the system:
- Fault-tolerant: it can recover from component failures without performing incorrect actions.
- Highly available: it can restore operations, permitting it to resume providing services even when some components have failed.
- Recoverable: failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
- Consistent: the system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
- Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size.
- Predictable performance: the ability to provide desired responsiveness in a timely manner.
- Secure: the system authenticates access to data and services.

How well did you know this?

Not at all

Perfectly

How does consideration of Latency relate to the observations made by CAP?

When a Network Partition is present, we have to choose between Availability vs Consistency.

If we choose Availability then we return the local state of a node, which results in a fast response (low latency) but not correct.

If we choose Consistency, then we wait for consensus among nodes, which results in a slow response (high latency) but is correct.

When a system is overloaded, we have to choose between Latency vs Consistency.

How well did you know this?

Not at all

Perfectly

Can you identify and describe the main elements and steps involved in a distributed RPC system?

API - the programming interface that clients and servers use to interact with the system

Stub - when client makes a request, they make a call to something that looks similar to a procedure but instead of having the program counter jump to a location in the address space that holds the implementation of the procedure, the RPC call results in a jump into the stub layer. this layer has knowledge about the remote procedure, its arguments and results, and will provide all steps required for marshaling and unmarshaling data.

RPC Runtime - responsible for tasks like connection management, sending and receiving data, dealing with failures, etc.
Interface Definition Language (IDL) - used to create an interface specification

RPC Compiler - takes the IDL and generates code that’s used by the stubs and the runtime.

Service Registry - rules for how servers announce their services and become discoverable

How well did you know this?

Not at all

Perfectly

Contrast exactly once, at most once, at least once, invocation semantics – what would be required from the RPC runtime, under which assumptions would such invocation semantics be feasible (or not)…

At Most Once semantics means that a message is guaranteed to be processed at most one time, but could possibly not be processed at all. No special work is required to support this. The client sends a request and hopes to hear back, but may not.

At Least Once semantics means a message is guaranteed to be processed one time, but could possibly be processed multiple times. To support At Least Once a client must continually retry messages until they receive an ack from the server.

Exactly Once semantics means a message is guaranteed to be processed exactly one time. To support Exactly Once the client must retry requests that haven’t received an ack and the server must keep state about which requests it has processed previously and resend the ack if it receives the request again.

How well did you know this?

Not at all

Perfectly

Why is time important and hard in distributed systems?

Time is needed to order events, which allows us to determine causality.

On a single node, we can rely on timestamps. But across multiple nodes, we can’t reliably order events by timestamps

Messages could be delayed or lost. A timestamp doesn’t communicate to a receiving node that it’s missing a message.

Clocks are not guaranteed to be in sync

How well did you know this?

Not at all

Perfectly

What is a logical clock? What are logical clocks used for? How is a logical clock related to physical clock?

Similar to a physical clock, a logical clock can:
- Generate timestamps
- Advance in some manner
- Be used to order events

Unlike a physical clock, a logical clock can be used to order events in a distributed system.

How well did you know this?

Not at all

Perfectly

What is required to implement a logical clock? What properties must be followed for this function to behave like a clock? Why/What’s the meaning of the Clock Conditions?

A monotonically increasing sequence generator. The function must always generate a higher sequence number.

Clock Consistency Condition - ei -> ej -> C(ei) < C(ej). Unidirectional. It’s only critical that “happens before” relationship is explained by clock values. You can’t go the other way (use clock values to say something about event order) This condition is satisfied by having a monotonically increasing clock

Strong Clock Condition - ei -> ej <> C(ei) < C(ej). Bidirectional. The timestamps define which event happen before/after another event

How well did you know this?

Not at all

Perfectly

Contrast Scalar, Vector, Matrix clocks? What is the cost or overhead associated with each? What is the benefit from each of those/what is the main challenge addressed by introducing each of these/what are the properties provided by each clock?

Scalar - same as Lamport’s clock. Overhead is low as only one value needs to be transmitted with each message. A scalar clock provides a partial order of events that is consistent for all viewers assuming there is a clear “tie-breaker”.

Vector - the vector represents a node’s view of timestamps of the entire system. The overhead is higher than scalar because each message needs to include a vector which is as large as the number of nodes in the system. A vector clock provides strong consistency. If the vector clock of event a is strictly larger than the vector clock of event b, then b happen before a. A vector clock is able to better identify concurrent events compared to scalar clocks.

Matrix - the matrix represents each node’s view of every other node of the system. This is useful if you need to find out when some value is no longer relevant across the entire system (garbage collection).

How well did you know this?

Not at all

Perfectly

Can you look at the example execution shown in the papers/video and explain how these clocks are getting updated? Can you point out what you are able to tell about some specific events based on the clock values alone?

TODO

How well did you know this?

Not at all

Perfectly

Understand the meaning and purpose of the concepts of distributed system state, consistent cut/snapshot, actual/observed run, …

System State - the state of all processes and channels in the system at a point in time during execution.

Consistent Cut/Snapshot - A consistent cut is a snapshot that represents a possible execution of the system with consistent ordering of events.

Actual Run - an actual execution of the system

Observed Run - a possible execution of the system that’s a valid ordering of events

Understand the snapshot algorithm, what are the assumptions under which it’s valid, why are those assumptions necessary and how are they reflected in the algorithm?

Assumptions:
- There are no failures and all messages arrive intact and only once. (TCP/IP allows this assumption to be true)
- The communication channels are unidirectional and FIFO ordered (TCP/IP supports)
- The snapshot algorithm does not interfere with the normal execution of the processes
- Each process in the system records its local state and the state of its incoming channels

Algorithm:

Initiator:
- Save its local state
- Send marker tokens on all outgoing edges

All other processes:
- On receiving the first marker on any incoming edge
- Save state, and propagate markers on all outgoing edges
- Resume execution, but also save incoming messages until a marker arrives through the channel

Can you trace through an execution the consistent state that could be captured by the algorithm?

TODO

By knowing the state of a property in a given state in the system, what can we tell about that same property at the start/at the end of the execution? Can you provide examples when this would be useful?

A property is stable if, once it becomes true, it remains true for the duration of system execution.

Examples: Deadlock, termination, token loss

A property is unstable, if once it becomes true, there is no guarantee that it remains true for the duration of system execution.

Examples: temporary buffer overflow, load spike, race condition

What is consensus? Explain in your own words/understand all elements of the definition in the papers/video

An agreement between all the processes in a system as something (could be a value, an action, the outcome of a transaction, etc.)

What is the goal of the FLP work, what did they want to learn/prove? Provide intuition about the approach they took to achieve this goal.

The authors were trying to prove whether consensus can be guaranteed for a simple model. If it’s not possible to guarantee consensus with a simple model, it’s also not possible with a more complex real world one.

The authors proved that consensus can’t be guaranteed by proof via contradiction.

State the FLP theorem and provide intuition of the proof/do you understand it?

FLP Theorem proves that in a system with 1 faulty processor and reordered and arbitrarily delayed messages, it is impossible to guarantee that a consensus can be reached.

What’s the intuition about the significance of FLP, in light of much other work of consensus algorithms, replication protocols, …

TODO

Contrast active vs. stand-by replication

In active replication all nodes serve requests & are responsible for propagating updates.

In stand-by replication only one node serves & requests and is responsible for propagating updates. The other node is kept in a consistent state so it can take over as primary in a fast failover.

Contrast state vs. log/RSM replication

With the state replication technique, operations are executed on one replica, then the entire state is copied to other replicas. + no need to re-execute operations - state may be large or hard to gather With the replicated state machine technique, operations are executed on each replica to produce the same state. + no need to send large state - must re-execute operations & ensure determinism

What are the problems addressed by chain replication? How are they addressed?

As a system adds more replicas, this creates more communication overhead for the primary (more nodes to send/receive messages, wait for consensus) Chain replication reduces the performance overhead of replication for write-heavy workloads by making it so each node is only responsible for communicating with one other node.

What are the problems created by chain replication? How are they addressed in CRAQ (high level)? Can you explain the result from the experimental comparison of CR and CRAQ?

In Chain Replication, reads are limited to the tail replica only. CRAQ modifies CR such that reads can be served by middle replicas. In the experimental comparison they found that CRAQ provides much higher read throughput than CR even as the number of writers increases.

What’s the main idea of rollback-recovery as a FT technique?

When a failure is detected we rollback a previous state (consistent cut) that we know is correct, then continue.

What are the differences and tradeoffs of checkpointing vs. logging as a FT technique? What are all the different metrics you’d need to think about when comparing the two?

Checkpointing results in faster recovery but requires lots of I/O Logging results in less I/O but longer recovery. We can combine the two approaches to limit the length of recovery and the space required to store the log, but we must be able to detect a stable consistent cut.

Describe and explain the pros/cons/tradeoffs of coordinated, uncoordinated, communication-induced checkpointing.

Uncoordinated checkpointing: + low overhead - Domino effect: could lose all your work - Useless checkpoints: checkpoints that can never form a globally consistent state may be taken - Multiple checkpoints per process: may need more than the most recent snapshots - Garbage collection: needed to identify obsolete checkpoints Coordinated checkpointing: + recovery no longer requires a dependency graph to calculate a recovery line. the latest checkpoint can be used + no domino effect. the coordination guarantees that the checkpoints taken are part of a consistent cut + single checkpoint per process + no garbage collection - how to coordinate? - no synchronous clock guarantee - message delivery reliable and in bounded time? - are all checkpoints needed? Communication-induced checkpointing: + low overhead - nodes that aren’t communicating may take unnecessary snapshots

We mention a number of factors which influence the choice of a FT technique. Can you reason through some examples, say considering change in storage cost, or system scale, or read/write ratio of the workload, and whether or how those changes would impact the winner among any two of the techniques we discussed?

TODO

Main idea or 2PC and 3PC

TODO