Part 1: introduction to ds Flashcards
Which are the consequences of distributed systems?
- Concurrency (concurrent access to a resource –> problems of inconsistency)
- Absence of global clock
- Failure independency
What is a distributed system?
A distributed system is one in which the failure of a computer
you didn’t even know existed can render your own computer
unusable. (definition adopted by SRC)
Which are the challenges in distributed system?
- Heterogeneity
- Openness
- Scalability
- Transparency
- Failure handling
- Concurrency
- Security
Explain openness
Another important goal of distributed systems is openness. An open distributed system is essentially a system that offers components that can easily be used by, or integrated into other systems. At the same time, an open distributed system itself will often consist of components that originate from elsewhere.
To be open means that components should adhere to standard rules that describe the syntax and semantics of what those components have to offer (i.e., which service they provide). A general approach is to define services through interfaces using an Interface Definition Language (IDL). Interface definitions written in an IDL nearly always capture only the syntax of services.”
This explanation emphasizes the importance of standardization and interoperability in open distributed systems. Openness ensures that components of the system can be easily integrated with other systems and that they adhere to standard rules and interfaces. This facilitates flexibility and extensibility in the design and implementation of distributed systems.
explain scalability
Scalability has become one of the most important design goals for developers of distributed systems due to the increasing connectivity through the Internet and the shift towards cloud-based services.
- Size Scalability: A system is scalable with respect to its size if it can easily accommodate more users and resources without a noticeable loss of performance. Centralized services often face limitations when scaling in size.
- Geographical Scalability: A system is geographically scalable if users and resources can be far apart, but significant communication delays are hardly noticeable. Many distributed systems designed for local-area networks face challenges in geographical scalability, especially those based on synchronous communication.
- Administrative Scalability: A system is administratively scalable if it can be easily managed even when spanning multiple independent administrative organizations. Peer-to-peer technology has demonstrated the potential when end users are in control, but it’s not a universal solution.
How can we hide communication latency?
Hiding communication latencies is particularly applicable in the case of geographical scalability. The basic principle is straightforward: try to minimize waiting for responses to remote-service requests as much as possible.
For instance, when a service is requested from a remote machine, instead of waiting for a reply from the server, an alternative approach is to perform other useful work on the requester’s side. This essentially means constructing the requesting application in such a way that it uses only asynchronous communication.
When a reply is received, the application is interrupted, and a special handler is called to complete the previously issued request. Asynchronous communication can often be used in batch-processing systems and parallel applications where independent tasks can be scheduled for execution while another task is waiting for communication to complete.
Another example is in Web browsers. A Web document often consists of an HTML file with plain text and a collection of images, icons, etc. To fetch each element, the browser sets up a TCP/IP connection, reads the incoming data, and passes it to a display component. These operations are inherently blocking. When dealing with long-haul communication, the time for each operation to complete can be significant. The usual way to hide these latencies is to initiate communication and immediately proceed with another task.
- ne common scenario in distributed systems is the use of forms, especially in database applications. When a user fills out a form, each field can be sent to the server for validation, and the client can wait for an acknowledgment from the server. This approach can introduce significant communication latencies, especially if the server checks for errors or inconsistencies for each field entry.
A more efficient approach is to move the code responsible for filling out and validating the form to the client side. The client can then complete the form, possibly checking the entries locally, and only send the completed form to the server. This reduces the number of messages exchanged between the client and the server, thereby hiding the communication latencies.
Which are the scaling techniques
Scalability problems in distributed systems often manifest as performance issues due to limited server and network capacity. Solutions include:
- Scaling Up: Improving the capacity of existing components.
- Scaling Out: Expanding the system by adding more machines.
- Hiding Communication Latencies: Techniques to mask the delay in communication.
- Distribution of Work: Dividing tasks among multiple entities.
- Replication: Creating copies of data or services to improve access speed and reliability. However, replication can introduce challenges, especially when ensuring consistency across replicas.
distribution (for scalability)
Another important scaling technique is partitioning and distribution. This involves taking a component, splitting it into smaller parts, and subsequently spreading those parts across the system. A good example of this is the Internet Domain Name System (DNS), where the DNS name space is hierarchically organized into a tree of domains, which are divided into nonoverlapping zones.
Replication (for scalability)
There are two primary reasons for replicating data:
- Reliability/availability: Data is replicated to increase the system’s reliability. If one replica crashes, the system can continue working by switching to another replica. Multiple copies also offer better protection against corrupted data.
- Performance: Replication for performance becomes crucial when a distributed system needs to scale in terms of size or geographical area. For instance, as more processes need to access data managed by a single server, performance can be improved by replicating the server and dividing the workload among the processes accessing the data.
- latency reduction: e.g. reading from a nearby replica
-> caching: special form of replica.
Cost: keep consistency.
While replication can improve reliability and performance, it introduces challenges related to consistency. Whenever a copy is modified, it becomes different from the rest. To ensure consistency, modifications must be carried out on all copies. The timing and method of these modifications determine the complexity of replication.
Transparency
transparency in distributed systems refers to the system’s ability to hide its internal structure and processes from users and applications, making it appear as a single, coherent system. This concept is crucial in distributed systems to provide a simplified experience despite the complex nature of the system’s operations and structure.
Here are key points about transparency in distributed systems, as discussed in the book:
Types of Transparency:
- Access Transparency: Allows resources to be accessed without knowledge of their physical or logical representation, hiding differences in data representation and how an object is accessed.
- Location Transparency: Hides the physical location of resources, meaning that users do not need to know where resources are located to access them.
- Migration Transparency: Ensures that resources and components can be moved without impacting how users interact with them.
- Relocation Transparency: Similar to migration transparency but occurs while the resource is in use.
- Replication Transparency: Hides that a resource is replicated in multiple locations, useful for providing fault tolerance and increasing availability.
- Concurrency Transparency: Ensures that the user is not aware of other users accessing the same resource concurrently.
- Failure Transparency: The system hides the failure and recovery of resources, making sure that even when components fail, the operations are not visibly affected, though this is challenging to implement perfectly.
Types of distributed systems
-
Distributed Computing Systems:
-
Cluster Computing:
- Systems that link together multiple computers using a local area network (LAN) to work as a single system. Often used for high performance computing tasks.
-
Grid Computing:
- Focuses on providing access to resources from different administrative domains. Designed for sharing resources within a virtual organization. The architecture often consists of layers, including a fabric layer for local resources, a connectivity layer for communication, and higher layers for resource management.
-
Cluster Computing:
-
Distributed Information Systems:
-
Transaction Processing Systems:
- Systems that handle a large number of small requests. These systems are primarily used in applications where response time is critical, such as banking or airline reservation systems.
-
Enterprise Application Integration:
- Systems that connect different software applications and databases to simplify and automate business processes, while ensuring that data remains consistent across all systems.
-
Transaction Processing Systems:
-
Distributed Pervasive Systems:
- Systems that are pervasive and continuously present. Users often interact with these systems without being aware of the interaction. Core requirements include distribution, interaction, context awareness, and autonomy. Examples include ubiquitous computing environments and Internet of Things (IoT) devices.
Why is it sometimes so hard to hide the occurrence and recovery from failures in a distributed system?
It is generally impossible to detect whether a server is actually down, or that it is simply slow in responding. Consequently, a system may have to report that a service is not available, although, in fact, the server is just slow.
+ we have multiple points of failure –> difficult to detect a failure
When a transaction is aborted, we have said that the world is restored to its previous state, as though the transaction had never happened. We lied. Give an example where resetting the world is impossible.
Any situation in which physical I/O has occurred cannot be reset. For example, if the process has printed some output, the ink cannot be removed from the paper. Also, in a system that controls any kind of industrial process, it is usually impossible to undo work that has been done.