Distributed Systems Flashcards

1
Q

What is a stub? And how do RMI and RPC differ?

A

A stub is something which supports communication between two distributed components. It behaves as a proxy so the distributed system developer can call procedures/methods on the server by calling it on the local stub just like any other method. The stub marshals arguments or return data so they can be sent between components in a machine independent way, the stub then sends the data so the stub on the other end can invoke the process on the server.

Stub also starts a timer for error handling purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the components of RPC?

A

• Server: are remote machine that provides the client with an interface that allows them to remotely call methods.
• Stub compiler: uses an IDL (interface definition language) to compile the stubs for the client and server.
• Linking: the client developer implements client functions calling methods on the stub and the server developer implements the server functions specified by the stub.
Operation: stubs manage all remote communications between the client and server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a object broker in the context of Java RMI?

A

An object broker is called the RMI registry.

The registry is a process that runs on a host machine itself. The object broker allows for the retrieval and distribution of objects. The RMI registry allows servers to register their objects by name. To access an object clients will search for these objects by name and receive a stub for this remote object. Clients can also request a list of all registered remote objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Despite requiring a location to a physical server, why does the remote object broker in Java’s RMI still respect location transparency?

A

The location of a physical server is provided through a representation that is a string, the protocol then handles the actual mapping of addresses - hence respecting location transparency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference(s) between RMI and MOM

A

Look in notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the modes of operation for JMS (Java message service)?

A

There are two modes of operation: point-to-point and publish/subscribe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the attributes ofweb services?

A

• Communication can pass through firewalls by default as theprotocols use HTTP
• As communication uses web based protocols components work in a heterogeneous environment as web communication is machine independent, so it supports interoperability.
XML based - machine readable documents for messages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is REST?

A

REST is representational state transfer. It is an architectural style

It treats the web as resource centric.

in a REST app, the urls relate to resources

REST uses JSON instead of XML where objects are key value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give 4 examples for why replication is used in a distributed system.

A

• Increased performance: users can have access data in a region closer to them to reduce network communication overhead.
• Increased availability: automatically maintain data or service availability despite server failures.
• Fault tolerance: guarantees strictly correct behaviour despite a certain number of faults.
• Allows systems to updated whilst remaining online.
Load distribution: at busy periods load can be spread across many servers so a single server isnt overloaded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the types of replication?

A

• Computation replication where multiple instances of the same computing task are executed.
Data replication where the same data is stored on multiple devices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is needed for replication to be good?

A

• Replication transparency - hide from the client that there are physicalreplicas so the system appears that there is 1 logical service
Data consistency - the same request will receive the same result regardless of the replica that processes the request

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the workflow of the system model.

A

You have a client that accesses a front-end server which coordinates a set of replica servers and acts as a proxy between an available replica server and a client. The client invokes a method on the stub of a front-end which in-turn invokes a remote method on an available replica server. Once the replica server has processed the request, it can then choose to propagate the results of any updates it has applied to other replica servers in order to maintain consistency. The result of a request will be sent back to the front end so the response can be sent to the client.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define a fault-tolerant service.

A

A service is referred to as being fault tolerantif it can provide correct service according to a specification despite having up to f process failures.

(A failure in a distributed system refers to not only the server going down but also providing inconsistent results than those that were expected)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many faults can a passive replication system tolerate vs an active replication system?

A

Passivecan tolerate f faults if there aref+1servers.

Activecan tolerate f faults if there are2f+1servers. This is because the front end waits until it has received f+1 responses then can send a response to the client

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe how passive replication works.

A

There exists a primary replication manager, which handles requests (from the FE/Client).Messages from the client are have a unique id so a RM can check if it has performedthe operation already.

The primary RM sends routinely updates to the backup RMs, when a backup receives data it will send an acknowledgement to the FE, If the primary RM fails the front end will select one of the backup RMs as the primary RM, this will be relayed to all the other RMs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe how active replication works.

A

The FE sends the request to every RM (broadcast) and they each carry out the operation. Once completed they return this to the FE, where the most popular response is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the benefits of each type of replication?

A

look in notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a Byzantine fault in the context of a distributed system and give two types of faults.

A

A Byzantine fault is one where the object in question appears different to two observers, in this context it can be the data on the replica and the expected data from the view of the client or it could be the replicas availability. An Omission failure is one where the request is failed to send. Commission failure is processing a request incorrectly which can corrupt internal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain why a distributed system cannot easily maintain globally unique timestamps. How to overcome this?

A

In order to maintain globally unique timestamps you require a centralised measurement of the time used within the system. Due to propagation delay, packet loss and latency transferring the value of this global time in the system to other servers can cause the system to become unsynchronised meaning duplicate timestamps could easily be generated. Use NTP to overcome this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the difference between a deadlock and a phantom deadlock.

A

A deadlock occurs when transaction A locks object X and transaction B locks object Y If A tries to access object Y after it has been locked A will wait for Y to become unlocked by B. If B tries to access object X after it has been locked it will wait for X to be unlocked by transaction A.

As objects are not unlocked until the locking transaction is finished and the transactions are waiting for each other, the system is halted and A and B can never complete. This situation is a deadlock.

A phantom deadlock is when a deadlock is “detected” but is not really a deadlock. Global deadlock detector requires time to receive local wait-for graphs for constructing the global one. Hence the wait for graph may no longer be valid by the time it is constructed.

21
Q

What are the two naive approaches for solving detecting/solving. Describe why each is bad.

A

Timeouts - make it so the a lock is unlocked a certain time after locking. Setting this time is hard because a transaction may be locking shared object may just be taking a long time or doing some heavy computation, and there actually isn’t a deadlock.

22
Q

Describe how edge chasing works? Why is it the best deadlock detection approach?

A

Edge chasing works by processes exchanging, probe messages. It does not rely on a centralized server to construct a global wait for graph. Therefore the issue of network latency or lost messages, will not cause a phantom deadlock.

When a process is locked out of a shared object it will send a probe message to the process that is locking the object. The probe message consists of:
id of the process that is blocked
id of the sender of the probe
id of the recipient of the probe

When receiving a probe a server will check if it is waiting for resources:
if it is waiting for a resource it will forward the probe message to the processes that are locking the resource it is waiting for (it will put its process id as the message sender) .
if it is not waiting for a resource it has locked carrying out a transaction and will release the objects when it is done.

If a process receives a probe that it recognizes as having initiated it knows there is a cycle in the system.

if a process recognizes the blocked process ID as its own it’s a deadlock

23
Q

What is the definition of linearizability?

A

Operations are linearizable if they can be interleaved in a way that is consistent with the specification of of a single correct copy of the shared objects. This interleaving has to be consistent with the times that the operations happened in the execution.
I.e can you give times that operations happened instantly so that if the operations happened in that order it would produce the results produced by the system. These lines need to appear within the start and finish times of the operations they correspond to.

24
Q

What are the 3 consistency models, explain them?

A

Strict consistency - The strongest type of consistency, where every query is up to date. E.g. a query on x is always of the latest version of x. Issues with this are that it needs an absolute global time, hence it is only practical within a single machine.

Sequential consistency - Essentially if it is sequential it is linearizable. One can interleave the operations but, object values must be seen in order by all processors. Where the order of the interleaving must be in the same order as they are executed.

Causal consistency - Updates that are causally related (e.g. they come from the same process) must be seen in the same order. However, if they are not related (e.g. they do not come from the same process) then the order which is seen can be relaxed.

25
Q

What are the 2 approaches for providing redundancy?

A

Architectural approach - implement passive or active replication
Operational approach - add redundancy in the system to avoid some faults
Information redundancy - use ECC or checksums
Time redundancy - perform the same operation a repeated number of times then use voting on the results. This allows for small minor faults to be tolerated but not large scale ones. This also gives a large performance hit.
Component redundancy - have different implementations of components that provide the same functionality, perform operations on these in parallel and do voting on the result. Can use N-version programming where you use, design diversity by using multiple versions of the same program. It can tolerate hardware and software faults but not correlated faults
Communication redundancy - make recipients of messages acknowledge receipt of a message. Sender can then resend lost messages if no ACK received. Exception handling can be implemented if a server cannot be reached at all.

26
Q

What are the four steps of general workflow towards fault tolerance?

A

Error detection - detecting whether an error has occurred in the system
Failure isolation - detecting where the error arose in the system. What components are affected?
Error containment - containing the faulty components to prevent further damage & error propagation.
Recovery - Apply a error recovery process on the components to render them to a fit state (a state in which they should not produce any errors)

27
Q

Describe backward recovery and its components.

A

In backward recovery the goal is to revert the system to a state with no errors. To implement this each component must create local checkpoints - these are a store of the components state so that the component could restore to this state and execute from there.

The system needs a global checkpoint - consists of local checkpoints from each DS components that form a consistent state. The most recent consistent global checkpoint is called the recovery line..

A global state is consistent where each component holds a message that it’s neighbouring state has a record of.

Each save state needs to have a recollection of sending a message to another component. Because m1 and m2 were delivered after P0 and P1’s saves, the saves have no recollection of sending them. Hence the three save states are not consistent with each other.

28
Q

Describe the 3 ways of collecting checkpoints for backward recovery.

A

Uncoordinated checkpointing - each process collects its checkpoints individually. This is automatic and convenient buy can produce some useless checkpoints, or lead to a domino effect.

Coordinated checkpointing - processes coordinate the checkpointing to save a consistent global state at certain intervals. This saves space and prevents the domino effect but is bad for perfomance

Communication-induced checkpointing - use application messages by adding info to them that will force a component to create a local checkpoint. Avoids domino effect, allows local checkpointing

29
Q

What is the domino effect?

A

The domino effect is a major problem with backward recovery. It is caused when there is loads of checkpoints that are inconsistent resulting in many states being discarded and a more historic recovery line. If a fault occurs there will be cascaded roll backs reverting the system far too back in the computation.

30
Q

What is forward recovery?

A

Fault masking - using redundancy or replication with a voting system to mask small faults like byzantine faults

Self-checking components - components check themselves for errors if an error is detected they will swap themselves out for one that is working

Data prediction - predict data by interpolating received data like in lag compensation

Error compensation - use and algorithm based on redundancy to compensate for small non coordinated errors happening.

31
Q

Compare and contrast backward recovery with forward recovery. Which would be better to use for applications with high-frequency transactions such as the stock market?

A

Backward recovery attempts to revert the system state to one that is previously stored via a checkpoint whereas forward recovery attempts to modify/find the state into a format from which the system can continue.
Backward recovery is more expensive in storage however forward recovery requires less time and memory.
You need no knowledge of the error with backward recovery, however with forward recovery you need to be able to inspect the error.
Backward recovery is application independent whereas forward recovery is dependent upon the application (and may change from program to program in functionality).

As such forward recovery is a more appropriate recovery type for the stock market as significant system delay is not acceptable.

32
Q

Explain the difference between mean-time-between-failures and mean-time-to-repair.

A

The mean-time-between-failure measure is the average time for which a system returns expected results on repeated trials. The mean-time-to-repair is the amount of time taken for the system to become available again (e.g. by performing backwards recovery or forward recovery).

33
Q

Describe availability in terms of mean-time-between-failures and mean-time-to-repair.

A

Availability - describes the fraction of time the system yields expected results.
A = (MTBF)/(MTBF + MTTR)

34
Q

What is the difference between load sharing and load balancing?

A

Load sharing is when idle servers are given requests to process (to reduce waste) whereas load balancing is done to reduce a certain value with the intention to equalize it across all servers.

35
Q

What factors can servers use to distribute load?

A

Servers can use the length of waiting tasks in the queue, their CPU utilization or their bandwidth utilization to choose when they should transfer requests to other servers.

36
Q

What is the difference between Non-preemptive task transfers and preemptive task transfers?

A

Non-preemptive refers to the transfer of tasks before execution. Hence only the request/task is transferred to another machine (good for load sharing but difficult for load balancing). Preemptive refers to the transfer of tasks that are partially executed. This is expensive and involves collection and transmission of task states (IO buffers, file pointers, timers, virtual memory image e.g. data already computed for that task).

37
Q

What are the three different approaches for Load Distribution?

A

A static approach is one in which decisions are hard-coded into an algorithm. A dynamic approach is one where decisions are made during runtime based on system states. The correctness of load distribution depends on the timeliness of parameter collected (local coordinator has less latency compared to a global coordinator). An adaptive approach changes the frequency based on system states.

38
Q

What are the components of a load distribution algorithm?

A

Information policy - this determines when, where and what information to collect. Can be:
Demand driven - info is only collected when it becomes a sender or receiver.
Periodic - servers exchange load information periodically.
State change driven - send information when their state changes by a certain amount.

Transfer policy - decides whether a server needs to transfer tasks. It determines thresholds for how busy a server needs to be, such as queue length. Defines roles, servers are senders when they are overloaded and receivers when they are overloaded.

Selection policy - the policy that decides which tasks should be sent from an overloaded server to another server. Can use estimates for task execution time, or server response time improvements

Location policy - the policy that decides what server to send selected tasks to. Can use polling and can be done in parallel by multicasting

39
Q

What is a sender-initiated load distribution algorithm?

A

It uses thresholds as a transfer policy. If a server has a utilization above the threshold it is overloaded and becomes a sender, if a server is below the threshold it becomes a receiver. When receiving a task a server will not exceed its own threshold to become overloaded.

Senders will select the newest tasks sent to the server to send to receivers. (ones at the back of the queue)

Sender can just send tasks to random servers, this is quick as there is no need for collecting the state of other servers, but you may send the task to an already overloaded server, and that server will have to send that task to another server.
Another approach is polling servers to see if it is a receiver. (if it’s queue length is below the threshold), when it receives the task it executes it regardless of how many tasks it has in its queue. In practice there is a limit to how many servers the sender can poll.

Another way is to poll servers for their queue length and send the task to the server with the smallest queue length.

The information policy for this approach is demand driven.

This approach becomes unstable at high loads. It can become difficult for senders to find receivers, as more servers are overloaded polling must go on for longer increasing activity to an already busy network. This can make the system unusable at very high global loads.

40
Q

What is a receiver-initiated load distribution algorithm?

A

Uses thresholds as its transfer policy. Where the lengths of the queue denote whether it is classified as a receiver or sender. (A queue of the tasks it holds). E.g. If it holds below T tasks then it is classified as a receiver and sender vice versa.

Senders will select the newest tasks sent to the server to send to receivers. (ones at the back of the queue)

Poll a random server and ask whether it has any tasks to share (depending on whether it is a sender or receiver).

Demand driven information policy

However the drawback of this is that in most cases this will result in preemptive transfers (transferring partially completed tasks) and hence costs more. This is due to the fact that systems schedule tasks as and when they arrive.

This approach doesn’t suffer from instability at high loads as it only becomes harder for recievers to find receivers at globally low levels of load. As polling goes on for longer when servers have queues below the threshold. The system can handle this increased level of network activity at low levels of load.

41
Q

What is a symmetric load distribution algorithm?

A

Senders search for receivers and receivers search for senders. At low loads, senders can find receivers easily and at high load receivers can find senders easily. It can have disadvantage of both, where polling at high loads can make the system unstable and receiver-initiated task transfers can be preemptive and hence more expensive. Can be implemented with a simple algorithm (which simply combines the previous two approaches and alternates via a threshold). The solution can be optimized by altering the scope of server search.

42
Q

What is an adaptive load distribution algorithm?

A

Each server maintains three lists, “Receiver list”, “Sender list” and “Ok list”. Each server classifies each other based on collected information and polls adaptively. The location policy at a sender is as follows:
The sender polls the head of the receiver list. The polled server puts the sender at the head of its sender list and informs the sender what classification it is. If the polled server is still a receiver, the new task is transferred, otherwise the sender updates the lists and polls the next potential receiver. If this polling process fails to identify a receiver, the task can be transferred using a receiver-initiated method instead.

43
Q

Describe Marshalling and Unmarshalling.

A
  • When a client invokes a method that accepts parameters on a remote object, the parameters are bundled into a message before being sent over the network.
    • These parameters may be of primitive types or objects.
      ○ In case of primitive type, the parameters are put together and a header is attached to it.
      ○ In case of objects, then they are serialized.
      ○ This process is known as marshalling.
      At the server side, the packed parameters are unbundled and then the required method is invoked. This process is known as unmarshalling.
44
Q

Describe the main functionality of a stub in a remote method invocation. Explain how a stub is created and executed.

A

A stub acts as a link between the client and remote methods. It marshals arguments from the client and transfers them over to the server in a format that can be understood. A stub can either be created at a remote server or on demand. When a client invokes a remote function, the corresponding stub is made available to the client from the server.

45
Q

Discuss whether the stub technology is an essential component of the message oriented middleware.

A

A stub isn’t essential in message-oriented middleware (MOM) as it decouples the client and server by making them communicate via message servers. Hence remote functions don’t need to be known by the client.

46
Q

Suppose an active replication system comprises four servers to provide the same remote services. The system occasionally has two servers failing at the same time.

Explain whether this server failing situation imposes any fatal problem to the system. If yes, suggest a solution. Otherwise, describe how the remote services keep running correctly when two servers fail.

A

An active replication system can support up to f failures when there are 2f+1 machines. If the system has two servers failing, there needs to be 5 machines in total. As there are currently four machines the system cannot support this many failures. To resolve it simply add another machine.

47
Q

Define the term “linearizability”, and justify its it’s importance to an active replication system. Suggest how linearizability can be practically implemented in such a system.

A

Linearizability is when you can formulate a sequential order on a set of events given their overlapping conditions. Active replication requires globally synchronized servers to process results at the same time in order to get correct results. It can be implemented by sequential consistency control.

48
Q

Suppose a centralized deadlock detection server is running on a highly-loaded distributed system to discover distributed deadlocks. Explain why accurate deadlock detection cannot be always guaranteed in such a distributed system.

Justify whether the sender-initiated load distributing algorithm can solve this problem.

A

Accurate deadlock detection cannot always be guaranteed in a distributed system due to network delays. Once a wait-for graph has been constructed the deadlock may have been resolved or could simply just be a phantom deadlock. A sender-initiated load distributing algorithm wouldn’t solve a network that is under-load as it has to actively search for receivers which can make the system even more unstable. Instead a receiver-initiated approach would be better in this example.