Introduction Flashcards
What is a distributed system ?
A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.
Definition of a Distributed System
Formal Definition:
A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.
Attributes:
Comprises multiple computing nodes.
Nodes interact autonomously to achieve a common goal.
Users perceive the system as unified despite its distributed nature.
Computing Elements in a Distributed System
Definition:
In a distributed system, computing elements are autonomous entities capable of computation.
In context, these elements are physical machines (nodes) interacting to serve users or other systems.
Characteristics:
Nodes operate independently and can run distinct programs.
They communicate and coordinate to achieve system goals.
Coherent Operation in a Distributed System
User Perspective:
Users interacting with a distributed system experience a unified and coherent system.
The system’s operation appears seamless, and users’ expectations are met consistently.
Example:
Instagram functions as a distributed system, providing a smooth experience across its features despite running on multiple nodes.
Key Aspects of a Distributed System
Node Functionality:
Nodes interact to achieve shared objectives.
Each node performs autonomous tasks.
User Experience:
A distributed system presents a unified and consistent interface to users.
Users interact with the system without noticing its distributed nature.
Achieving Coherence
Objective:
Distributed systems aim to avoid inconsistencies in user experience.
Efforts focus on maintaining coherence across system functionalities.
Challenge:
Achieving coherence in a distributed system is complex due to its decentralized nature.
Most Obvious Distributed System
Example:
The internet serves as a widespread distributed system.
Networked nodes globally interact to facilitate data exchange and user interaction.
Characteristics:
Comprised of numerous interconnected nodes.
Seamlessly serves users’ requests across multiple servers.
Most Obvious Distributed System
Example:
The internet serves as a widespread distributed system.
Networked nodes globally interact to facilitate data exchange and user interaction.
Characteristics:
Comprised of numerous interconnected nodes.
Seamlessly serves users’ requests across multiple servers.
Simple Example: My Cool App
Scenario:
Development of a mobile app named “My Cool App” for user interaction.
Requires server interaction and data retrieval from a database.
Initial Setup:
Single node hosts both the server program and the database.
Potential risks include data loss on node failure.
Loneliness Mitigation
Strategy:
Separate the server and database programs onto distinct nodes.
Mitigates risks associated with single-node failure.
Enhancement:
Increases system resilience by distributing server and database functionalities across separate nodes.
Strengthening the System
Enhancement Approach:
Expand the fleet by adding more server and database nodes.
Load balancing ensures even distribution of user requests across multiple server nodes.
Replication:
Database nodes contain replicated data for redundancy and data consistency.
Transition to a Distributed System
Characteristics:
Transition from single-node to multi-node architecture.
System operates coherently and autonomously despite multiple nodes.
Goal:
Prevent system failure due to single-node issues.
Scale resources to handle increased user load.
Key Takeaways
Load Handling:
Increased resources required to handle higher system loads.
Multi-node architecture mitigates single-node failure risks.
Autonomy and Coherence:
Nodes operate independently yet coherently to serve user needs.
Ensures a seamless experience despite the distributed architecture.
Hardware Limitations and Failures
Single nodes can crash due to hardware failures.
Scaling with a single node has limitations in terms of performance and capacity.
No hardware is infallible; everything degrades over time.
Consistency and Redundancy
Running multiple copies of programs on different nodes ensures consistency and redundancy.
Redundancy prevents a total system failure if one or more nodes go down.
Scalability
As the user base grows, the system needs to scale to handle increased load.
Distributing server programs across multiple nodes helps manage the increased load efficiently.
System Reliability and Load Balancing
Distributing programs across nodes prevents overwhelming a single node with excessive requests.
Redundancy ensures that even if some nodes fail, the system still functions, albeit partially.
Distributed Systems: Precautions and Considerations
Evaluating the need for a distributed system is crucial.
It’s essential to weigh the benefits against the complexity and costs.
Business requirements and user needs should guide the decision to adopt a distributed system.
Decision-Making for Distributed Systems
Critical questions about scalability, downtime, and resource utilization need to be addressed.
Careful analysis helps determine if the complexity of a distributed system aligns with business needs.
Cautionary Approach
While distributed systems may seem attractive, careful evaluation is necessary before implementation.
Understanding the business requirements and use-case scenarios helps in making informed decisions.
What does fault tolerance mean in the context of distributed systems?
Fault tolerance in distributed systems refers to the system’s ability to continue operating and providing its services even in the presence of faults or failures in different components or nodes within the system. It involves designing the system to handle and recover from faults to maintain functionality.
Why we need fault tolerance?
Other Fault Tolerance
Examples of Faults in Distributed Systems
Highlight various scenarios like hardware failures, software bugs, network disruptions, and sudden traffic spikes that can cause faults in distributed systems.
Importance of Fault Tolerance
Why is fault tolerance essential in distributed systems? Discuss how it prevents system-wide failures and ensures continuous operation despite faults.
Strategies for Achieving Fault Tolerance
Explain different strategies like redundancy, replication, error handling, and graceful degradation used to achieve fault tolerance in distributed systems.
Challenges in Implementing Fault Tolerance
Explore the difficulties and complexities involved in building fault-tolerant distributed systems, such as maintaining consistency and performance while tolerating faults.
What is the definition of reliability in the context of distributed systems?
Reliability in distributed systems refers to the system’s capability to function correctly even when facing faults or failures in various components or nodes within the system. It involves ensuring that the system continues to serve users’ expectations, handles errors gracefully, remains performant, and endures malicious attacks or abuses.
Fault Tolerance:
Definition: Fault tolerance refers to a system’s ability to continue functioning, albeit in a degraded mode or reduced capacity, in the presence of faults or failures within its components. These faults could be hardware failures, software bugs, or network disruptions.
Focus: It primarily emphasizes the system’s ability to handle and recover from faults without causing complete system failure or severe disruption to its operations.
Strategies: Fault tolerance involves implementing mechanisms and strategies that enable a system to detect, isolate, and recover from faults. These may include redundancy, replication, error detection and correction codes, self-healing mechanisms, and graceful degradation.
Reliability:
Definition: Reliability in a distributed system refers to the system’s capability to consistently provide correct and expected functionality, delivering the intended service or output even in the face of faults or failures.
Focus: It emphasizes the overall dependability and consistency of the system in fulfilling its intended purpose or service, regardless of occasional faults that may occur within its components.
Scope: Reliability encompasses not only fault tolerance but also the system’s ability to maintain correctness, availability, and performance, meeting users’ expectations under normal and fault scenarios.
Fault Tolerance, Reliability Differences
Emphasis: Fault tolerance focuses on the system’s ability to handle faults and continue operating despite the presence of failures, whereas reliability focuses on consistently delivering correct and expected functionality regardless of faults.
Approach: Fault tolerance involves implementing specific mechanisms and strategies to address faults and mitigate their impact on the system. Reliability, on the other hand, is a broader concept encompassing the overall trustworthiness and dependability of the system.
Outcome: The goal of fault tolerance is to minimize the impact of faults by providing mechanisms for fault recovery and resilience. Reliability aims to ensure consistent and correct system behavior, maintaining service quality and user expectations.
Fault Tolerance:
Definition: System’s ability to continue operating despite component faults or failures.
Focus: Handling faults without causing complete system failure.
Strategies: Redundancy, replication, error detection, self-healing mechanisms.
Reliability:
Definition: System’s consistency in providing correct functionality even with faults.
Focus: Consistently delivering expected services despite occasional faults.
Scope: Encompasses correctness, availability, and meeting user expectations.
Handling Hardware Faults:
RAID: Redundant Array of Independent Disks to store data redundantly across multiple drives.
Purpose: Ensures data recovery in case of disk failures, reducing downtime.
Handling Software Faults:
Testing: Writing unit, integration, and end-to-end tests to ensure code correctness and interaction reliability.
Error Handling: Implementing robust error handling across the codebase to manage unexpected behaviors.
Monitoring: Continuous monitoring of critical system parts, adding logs, and using third-party services for immediate issue detection.
Summary:
Hardware Faults: Mitigated using RAID for redundancy, ensuring data retrieval and system recovery.
Software Faults: Managed through rigorous testing, error handling, and active system monitoring for fault detection and response.
Availability Definition:
Property: Ensures the system is ready to serve users when needed.
Calculation: Availability (A) = (U / (U + D)), where U is the time system was up, and D is the time system was down.
Motivation for Availability:
Business Impact: Downtime equals potential monetary loss and user dissatisfaction.
Use Cases: Critical for e-commerce, learning platforms, and any service dependent on user accessibility.