Distributed Systems Flashcards

Question

What are the 2 approaches for providing redundancy?

Answer 1

Architectural approach - implement passive or active replication Operational approach - add redundancy in the system to avoid some faults Information redundancy - use ECC or checksums Time redundancy - perform the same operation a repeated number of times then use voting on the results. This allows for small minor faults to be tolerated but not large scale ones. This also gives a large performance hit. Component redundancy - have different implementations of components that provide the same functionality, perform operations on these in parallel and do voting on the result. Can use N-version programming where you use, design diversity by using multiple versions of the same program. It can tolerate hardware and software faults but not correlated faults Communication redundancy - make recipients of messages acknowledge receipt of a message. Sender can then resend lost messages if no ACK received. Exception handling can be implemented if a server cannot be reached at all.

Answer 2

Error detection - detecting whether an error has occurred in the system Failure isolation - detecting where the error arose in the system. What components are affected? Error containment - containing the faulty components to prevent further damage & error propagation. Recovery - Apply a error recovery process on the components to render them to a fit state (a state in which they should not produce any errors)

Answer 3

In backward recovery the goal is to revert the system to a state with no errors. To implement this each component must create local checkpoints - these are a store of the components state so that the component could restore to this state and execute from there. The system needs a global checkpoint - consists of local checkpoints from each DS components that form a consistent state. The most recent consistent global checkpoint is called the recovery line.. A global state is consistent where each component holds a message that it’s neighbouring state has a record of. Each save state needs to have a recollection of sending a message to another component. Because m1 and m2 were delivered after P0 and P1’s saves, the saves have no recollection of sending them. Hence the three save states are not consistent with each other.

Answer 4

Uncoordinated checkpointing - each process collects its checkpoints individually. This is automatic and convenient buy can produce some useless checkpoints, or lead to a domino effect. Coordinated checkpointing - processes coordinate the checkpointing to save a consistent global state at certain intervals. This saves space and prevents the domino effect but is bad for perfomance Communication-induced checkpointing - use application messages by adding info to them that will force a component to create a local checkpoint. Avoids domino effect, allows local checkpointing

Answer 5

The domino effect is a major problem with backward recovery. It is caused when there is loads of checkpoints that are inconsistent resulting in many states being discarded and a more historic recovery line. If a fault occurs there will be cascaded roll backs reverting the system far too back in the computation.

Answer 6

Fault masking - using redundancy or replication with a voting system to mask small faults like byzantine faults Self-checking components - components check themselves for errors if an error is detected they will swap themselves out for one that is working Data prediction - predict data by interpolating received data like in lag compensation Error compensation - use and algorithm based on redundancy to compensate for small non coordinated errors happening.

Answer 7

Backward recovery attempts to revert the system state to one that is previously stored via a checkpoint whereas forward recovery attempts to modify/find the state into a format from which the system can continue. Backward recovery is more expensive in storage however forward recovery requires less time and memory. You need no knowledge of the error with backward recovery, however with forward recovery you need to be able to inspect the error. Backward recovery is application independent whereas forward recovery is dependent upon the application (and may change from program to program in functionality). As such forward recovery is a more appropriate recovery type for the stock market as significant system delay is not acceptable.

Answer 8

The mean-time-between-failure measure is the average time for which a system returns expected results on repeated trials. The mean-time-to-repair is the amount of time taken for the system to become available again (e.g. by performing backwards recovery or forward recovery).

Answer 9

Availability - describes the fraction of time the system yields expected results. A = (MTBF)/(MTBF + MTTR)

Answer 10

Load sharing is when idle servers are given requests to process (to reduce waste) whereas load balancing is done to reduce a certain value with the intention to equalize it across all servers.

Answer 11

Servers can use the length of waiting tasks in the queue, their CPU utilization or their bandwidth utilization to choose when they should transfer requests to other servers.

Answer 12

Non-preemptive refers to the transfer of tasks before execution. Hence only the request/task is transferred to another machine (good for load sharing but difficult for load balancing). Preemptive refers to the transfer of tasks that are partially executed. This is expensive and involves collection and transmission of task states (IO buffers, file pointers, timers, virtual memory image e.g. data already computed for that task).

Answer 13

A static approach is one in which decisions are hard-coded into an algorithm. A dynamic approach is one where decisions are made during runtime based on system states. The correctness of load distribution depends on the timeliness of parameter collected (local coordinator has less latency compared to a global coordinator). An adaptive approach changes the frequency based on system states.

Answer 14

Information policy - this determines when, where and what information to collect. Can be: Demand driven - info is only collected when it becomes a sender or receiver. Periodic - servers exchange load information periodically. State change driven - send information when their state changes by a certain amount. Transfer policy - decides whether a server needs to transfer tasks. It determines thresholds for how busy a server needs to be, such as queue length. Defines roles, servers are senders when they are overloaded and receivers when they are overloaded. Selection policy - the policy that decides which tasks should be sent from an overloaded server to another server. Can use estimates for task execution time, or server response time improvements Location policy - the policy that decides what server to send selected tasks to. Can use polling and can be done in parallel by multicasting

Answer 15

It uses thresholds as a transfer policy. If a server has a utilization above the threshold it is overloaded and becomes a sender, if a server is below the threshold it becomes a receiver. When receiving a task a server will not exceed its own threshold to become overloaded. Senders will select the newest tasks sent to the server to send to receivers. (ones at the back of the queue) Sender can just send tasks to random servers, this is quick as there is no need for collecting the state of other servers, but you may send the task to an already overloaded server, and that server will have to send that task to another server. Another approach is polling servers to see if it is a receiver. (if it's queue length is below the threshold), when it receives the task it executes it regardless of how many tasks it has in its queue. In practice there is a limit to how many servers the sender can poll. Another way is to poll servers for their queue length and send the task to the server with the smallest queue length. The information policy for this approach is demand driven. This approach becomes unstable at high loads. It can become difficult for senders to find receivers, as more servers are overloaded polling must go on for longer increasing activity to an already busy network. This can make the system unusable at very high global loads.

Answer 16

Uses thresholds as its transfer policy. Where the lengths of the queue denote whether it is classified as a receiver or sender. (A queue of the tasks it holds). E.g. If it holds below T tasks then it is classified as a receiver and sender vice versa. Senders will select the newest tasks sent to the server to send to receivers. (ones at the back of the queue) Poll a random server and ask whether it has any tasks to share (depending on whether it is a sender or receiver). Demand driven information policy However the drawback of this is that in most cases this will result in preemptive transfers (transferring partially completed tasks) and hence costs more. This is due to the fact that systems schedule tasks as and when they arrive. This approach doesn't suffer from instability at high loads as it only becomes harder for recievers to find receivers at globally low levels of load. As polling goes on for longer when servers have queues below the threshold. The system can handle this increased level of network activity at low levels of load.

Answer 17

Senders search for receivers and receivers search for senders. At low loads, senders can find receivers easily and at high load receivers can find senders easily. It can have disadvantage of both, where polling at high loads can make the system unstable and receiver-initiated task transfers can be preemptive and hence more expensive. Can be implemented with a simple algorithm (which simply combines the previous two approaches and alternates via a threshold). The solution can be optimized by altering the scope of server search.

Answer 18

Each server maintains three lists, “Receiver list”, “Sender list” and “Ok list”. Each server classifies each other based on collected information and polls adaptively. The location policy at a sender is as follows: The sender polls the head of the receiver list. The polled server puts the sender at the head of its sender list and informs the sender what classification it is. If the polled server is still a receiver, the new task is transferred, otherwise the sender updates the lists and polls the next potential receiver. If this polling process fails to identify a receiver, the task can be transferred using a receiver-initiated method instead.

Answer 19

- When a client invokes a method that accepts parameters on a remote object, the parameters are bundled into a message before being sent over the network. - These parameters may be of primitive types or objects. ○ In case of primitive type, the parameters are put together and a header is attached to it. ○ In case of objects, then they are serialized. ○ This process is known as marshalling. At the server side, the packed parameters are unbundled and then the required method is invoked. This process is known as unmarshalling.

Answer 20

A stub acts as a link between the client and remote methods. It marshals arguments from the client and transfers them over to the server in a format that can be understood. A stub can either be created at a remote server or on demand. When a client invokes a remote function, the corresponding stub is made available to the client from the server.

Answer 21

A stub isn’t essential in message-oriented middleware (MOM) as it decouples the client and server by making them communicate via message servers. Hence remote functions don’t need to be known by the client.

Answer 22

An active replication system can support up to f failures when there are 2f+1 machines. If the system has two servers failing, there needs to be 5 machines in total. As there are currently four machines the system cannot support this many failures. To resolve it simply add another machine.

Answer 23

Linearizability is when you can formulate a sequential order on a set of events given their overlapping conditions. Active replication requires globally synchronized servers to process results at the same time in order to get correct results. It can be implemented by sequential consistency control.

Answer 24

Accurate deadlock detection cannot always be guaranteed in a distributed system due to network delays. Once a wait-for graph has been constructed the deadlock may have been resolved or could simply just be a phantom deadlock. A sender-initiated load distributing algorithm wouldn’t solve a network that is under-load as it has to actively search for receivers which can make the system even more unstable. Instead a receiver-initiated approach would be better in this example.

Distributed Systems Flashcards

(48 cards)