Lecture 1: Introduction Flashcards
Distributed Systems
A set of cooperating computers communicating over a network to achieve a coherent task. Examples include storage for big websites, big data computations like MapReduce, and peer-to-peer file sharing.
Decentralization
The absence of a central server in a system, where each node acts as both a client and a server. This can lead to increased robustness and fault tolerance.
Fault Tolerance
The ability of a system to continue operating properly in the event of failure of one or more components.
Registers
The fastest form of storage directly built into the CPU. Registers store small amounts of data and instructions being processed by the CPU, enabling nearly instantaneous access.
Nonvolatile Memory
Storage medium that retains data even when power is turned off, such as NAND flash memory used in SSDs and USB drives. Offers faster read and write speeds compared to traditional magnetic and optical storage.
Symmetric Multiprocessing (SMP)
Multiprocessor system architecture where all processors are treated equally and share access to all system resources, including memory and I/O devices.
Asymmetric Multiprocessing (AMP)
Multiprocessor system architecture where one processor is designated as the master and controls the system, while the other processors act as subordinate processors and perform specialized tasks.
Observer Pattern
Design pattern where an object, known as the subject, maintains a list of its dependents, called observers, and notifies them of any state changes, enabling them to react accordingly.
Name two common approaches to achieving fault tolerance in distributed systems.
Two common approaches are availability, where the system continues to operate despite certain failures, and recoverability, where the system can recover from failures and resume normal operation after repairs.
What are some tools used for achieving fault tolerance in distributed systems?
Non-volatile storage, such as hard drives or flash drives, is commonly used to store checkpoints or logs of system state to recover from failures.
How does scalability relate to fault tolerance in distributed systems?
Scalability allows distributed systems to handle failures by adding more resources or nodes to compensate for failures, maintaining performance despite disruptions.
What is fault tolerance, and why is it important in distributed systems?
Fault tolerance is the ability of a system to continue operating properly in the event of failure of one or more components. It’s crucial in distributed systems due to the inherent complexity and potential for failures in large-scale deployments.
What are some implementation topics commonly encountered in distributed systems?
Common implementation topics include remote procedure call (RPC), threads, and concurrency control mechanisms like locks.
What are the main infrastructure components in distributed systems?
The main infrastructure components are storage, communication, and computation.
What are some examples of distributed systems applications?
Examples include storage for big websites, big data computations such as MapReduce, and peer-to-peer file sharing.
What is the core concept of distributed systems?
The core concept of distributed systems is a set of cooperating computers communicating over a network to achieve a coherent task.
What are some reasons for building distributed systems?
Reasons include achieving high performance through parallelism, fault tolerance, handling naturally distributed problems, and achieving security through isolation.
Describe the scalability challenge in distributed systems.
Scalability refers to the ability to handle increasing workload or data by adding resources. The challenge lies in ensuring that adding resources results in proportional performance improvements.
How do partial failures differ from complete failures in distributed systems?
Partial failures occur when some components of the system stop working while others continue, whereas complete failures involve the entire system becoming inoperative.
What is the role of threads in distributed systems?
Threads are used for concurrent programming in distributed systems to harness multi-core CPUs and structure concurrent operations to simplify programming.
Explain the concept of recoverability in fault tolerance.
Recoverability refers to the system’s ability to resume normal operation after a failure, typically by saving state information and restoring it when the failure is resolved.
What is the significance of non-volatile storage in fault tolerance?
Non-volatile storage allows systems to store critical state information persistently, enabling them to recover from failures by restoring the system’s previous state.
How do distributed systems address the challenge of network failures?
Distributed systems often incorporate redundancy and error-handling mechanisms to mitigate the impact of network failures, such as replicating data or employing routing protocols.
What are the primary goals of building abstractions in distributed systems?
The goals are to simplify the interface for applications, hide the distributed nature of the underlying infrastructure, and enable easier development of distributed applications.