Midterm Review Flashcards
What are the defining characteristics of a distributed system?
A collection of computing units that interact by exchanging messages via an interconnection network and appear to external users as a single coherent computing facility (e.g., there is a common goal that the system must accomplish and all system components must contribute to it)
What are some simple models to represent a DS? Why would you choose to use a system model?
A simple model is composed of two nodes that exchange messages.
A system model is composed of:
- two or more nodes
- connected via communication channels
- send and receive messages
- each node contributes to the overall system state
A system model allows us to analyze system behavior w/o having to build prototypes and perform experimental validation under all scenarios.
What do we mean by asynchronous? What about asynchrony makes distributed computing hard?
Asynchronous means not coordinated relative to time. In a synchronous system we can expect messages to be delivered instantly or within a fixed amount of time. In an asynchronous system message delivery is unpredictable.
Asynchrony makes distributed computing hard because we have to design to withstand message loss, reordering, and unpredictable delay.
What about failures makes distributed computing hard?
There are different types of failures:
- Failstop failure - something just stops working
- Transient failure - something temporarily fails
- Byzantine failure - system is misbehaving. The system hasn’t failed but it’s performing incorrect actions.
It’s hard to understand whether there’s a failure, what type of failure, which component has failed…presumably because who decides what’s “correct”?
Why is consistency a hard problem in distributed computing?
Consistency means we have a single and up-to-date copy of any data or state that’s part of the distributed system and that all nodes will have agreement on what that data/state is.
Factors that influence consistency:
- concurrency, ordering of operations
- data replication, caching
These factors introduce tradeoffs for system performance and failure tolerance
Pick one of the Fallacies. Can you explain why it’s not true and how does it make distributed computing more challenging?
TODO
Can you explain “The system gives correct answers always, regardless of failures”? What does it mean with respect to the required properties of the system?
We want the system to have consistency, availability and fault tolerance.
Required properties of the system:
- Fault-tolerant: it can recover from component failures without performing incorrect actions.
- Highly available: it can restore operations, permitting it to resume providing services even when some components have failed.
- Recoverable: failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
- Consistent: the system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
- Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size.
- Predictable performance: the ability to provide desired responsiveness in a timely manner.
- Secure: the system authenticates access to data and services.
How does consideration of Latency relate to the observations made by CAP?
When a Network Partition is present, we have to choose between Availability vs Consistency.
If we choose Availability then we return the local state of a node, which results in a fast response (low latency) but not correct.
If we choose Consistency, then we wait for consensus among nodes, which results in a slow response (high latency) but is correct.
When a system is overloaded, we have to choose between Latency vs Consistency.
Can you identify and describe the main elements and steps involved in a distributed RPC system?
API - the programming interface that clients and servers use to interact with the system
Stub - when client makes a request, they make a call to something that looks similar to a procedure but instead of having the program counter jump to a location in the address space that holds the implementation of the procedure, the RPC call results in a jump into the stub layer. this layer has knowledge about the remote procedure, its arguments and results, and will provide all steps required for marshaling and unmarshaling data.
RPC Runtime - responsible for tasks like connection management, sending and receiving data, dealing with failures, etc.
Interface Definition Language (IDL) - used to create an interface specification
RPC Compiler - takes the IDL and generates code that’s used by the stubs and the runtime.
Service Registry - rules for how servers announce their services and become discoverable
Contrast exactly once, at most once, at least once, invocation semantics – what would be required from the RPC runtime, under which assumptions would such invocation semantics be feasible (or not)…
At Most Once semantics means that a message is guaranteed to be processed at most one time, but could possibly not be processed at all. No special work is required to support this. The client sends a request and hopes to hear back, but may not.
At Least Once semantics means a message is guaranteed to be processed one time, but could possibly be processed multiple times. To support At Least Once a client must continually retry messages until they receive an ack from the server.
Exactly Once semantics means a message is guaranteed to be processed exactly one time. To support Exactly Once the client must retry requests that haven’t received an ack and the server must keep state about which requests it has processed previously and resend the ack if it receives the request again.
Why is time important and hard in distributed systems?
Time is needed to order events, which allows us to determine causality.
On a single node, we can rely on timestamps. But across multiple nodes, we can’t reliably order events by timestamps
Messages could be delayed or lost. A timestamp doesn’t communicate to a receiving node that it’s missing a message.
Clocks are not guaranteed to be in sync
What is a logical clock? What are logical clocks used for? How is a logical clock related to physical clock?
Similar to a physical clock, a logical clock can:
- Generate timestamps
- Advance in some manner
- Be used to order events
Unlike a physical clock, a logical clock can be used to order events in a distributed system.
What is required to implement a logical clock? What properties must be followed for this function to behave like a clock? Why/What’s the meaning of the Clock Conditions?
A monotonically increasing sequence generator. The function must always generate a higher sequence number.
Clock Consistency Condition - ei -> ej -> C(ei) < C(ej). Unidirectional. It’s only critical that “happens before” relationship is explained by clock values. You can’t go the other way (use clock values to say something about event order) This condition is satisfied by having a monotonically increasing clock
Strong Clock Condition - ei -> ej <> C(ei) < C(ej). Bidirectional. The timestamps define which event happen before/after another event
Contrast Scalar, Vector, Matrix clocks? What is the cost or overhead associated with each? What is the benefit from each of those/what is the main challenge addressed by introducing each of these/what are the properties provided by each clock?
Scalar - same as Lamport’s clock. Overhead is low as only one value needs to be transmitted with each message. A scalar clock provides a partial order of events that is consistent for all viewers assuming there is a clear “tie-breaker”.
Vector - the vector represents a node’s view of timestamps of the entire system. The overhead is higher than scalar because each message needs to include a vector which is as large as the number of nodes in the system. A vector clock provides strong consistency. If the vector clock of event a is strictly larger than the vector clock of event b, then b happen before a. A vector clock is able to better identify concurrent events compared to scalar clocks.
Matrix - the matrix represents each node’s view of every other node of the system. This is useful if you need to find out when some value is no longer relevant across the entire system (garbage collection).
Can you look at the example execution shown in the papers/video and explain how these clocks are getting updated? Can you point out what you are able to tell about some specific events based on the clock values alone?
TODO