Introduction Flashcards

Question

Fault Tolerance:

Answer 1

Definition: Fault tolerance refers to a system's ability to continue functioning, albeit in a degraded mode or reduced capacity, in the presence of faults or failures within its components. These faults could be hardware failures, software bugs, or network disruptions. Focus: It primarily emphasizes the system's ability to handle and recover from faults without causing complete system failure or severe disruption to its operations. Strategies: Fault tolerance involves implementing mechanisms and strategies that enable a system to detect, isolate, and recover from faults. These may include redundancy, replication, error detection and correction codes, self-healing mechanisms, and graceful degradation.

Answer 2

Definition: Reliability in a distributed system refers to the system's capability to consistently provide correct and expected functionality, delivering the intended service or output even in the face of faults or failures. Focus: It emphasizes the overall dependability and consistency of the system in fulfilling its intended purpose or service, regardless of occasional faults that may occur within its components. Scope: Reliability encompasses not only fault tolerance but also the system's ability to maintain correctness, availability, and performance, meeting users' expectations under normal and fault scenarios.

Answer 3

Emphasis: Fault tolerance focuses on the system's ability to handle faults and continue operating despite the presence of failures, whereas reliability focuses on consistently delivering correct and expected functionality regardless of faults. Approach: Fault tolerance involves implementing specific mechanisms and strategies to address faults and mitigate their impact on the system. Reliability, on the other hand, is a broader concept encompassing the overall trustworthiness and dependability of the system. Outcome: The goal of fault tolerance is to minimize the impact of faults by providing mechanisms for fault recovery and resilience. Reliability aims to ensure consistent and correct system behavior, maintaining service quality and user expectations.

Answer 4

Definition: System's ability to continue operating despite component faults or failures. Focus: Handling faults without causing complete system failure. Strategies: Redundancy, replication, error detection, self-healing mechanisms.

Answer 5

Definition: System's consistency in providing correct functionality even with faults. Focus: Consistently delivering expected services despite occasional faults. Scope: Encompasses correctness, availability, and meeting user expectations.

Answer 6

RAID: Redundant Array of Independent Disks to store data redundantly across multiple drives. Purpose: Ensures data recovery in case of disk failures, reducing downtime.

Answer 7

Testing: Writing unit, integration, and end-to-end tests to ensure code correctness and interaction reliability. Error Handling: Implementing robust error handling across the codebase to manage unexpected behaviors. Monitoring: Continuous monitoring of critical system parts, adding logs, and using third-party services for immediate issue detection.

Answer 8

Hardware Faults: Mitigated using RAID for redundancy, ensuring data retrieval and system recovery. Software Faults: Managed through rigorous testing, error handling, and active system monitoring for fault detection and response.

Answer 9

Property: Ensures the system is ready to serve users when needed. Calculation: Availability (A) = (U / (U + D)), where U is the time system was up, and D is the time system was down.

Answer 10

Business Impact: Downtime equals potential monetary loss and user dissatisfaction. Use Cases: Critical for e-commerce, learning platforms, and any service dependent on user accessibility.

Answer 11

Percentage Allowed Downtime 90% : 36.5 days 95% : 18.25 days 99% : 3.65 days 99.5% : 1.83 days 99.9% : 8.76 hours 99.95% : 4.38 hours 99.99% : 52.6 minutes 99.999% : 5.26 minutes

Answer 12

System's Expectation: Aim for higher availability percentages to reduce downtime. Denoting Availability: Represented by "nines" (e.g., 99.99% is 4-nines).

Answer 13

Definition: A component whose failure can bring down the entire system. Example: Home router: Failure leads to loss of internet access.

Answer 14

Redundancy: Add backup resources to handle system failures and avoid Single Points of Failure (SPoF). Expectations: Set realistic availability expectations; even the best systems can face rare, simultaneous failures.

Answer 15

Service Level Agreement (SLA): Agreement between system and clients outlining requirements, e.g., uptime or response time. Service Level Objective (SLO): Specific promise in an SLA, e.g., loading feeds within 200ms. Service Level Indicator (SLI): Actual measured metrics compared to the SLA/SLO, e.g., promised uptime vs. actual uptime.

Answer 16

Continuous Tuning: SLA, SLO, and SLI measurements tune expectations over time for better system improvements. Focus on Improvements: These metrics guide system-level improvements and focus areas for better service delivery.

Answer 17

Capability: It’s the system’s ability to handle increased usage or load efficiently. Increased Load: Addressing system capability when user count or activity surges.

Answer 18

System Impact: How does the user base increase affect the system's performance? Handling Increased Load: Can the system manage the amplified load without issues? Resource Addition: Strategies for seamlessly adding computing resources to accommodate the load surge.

Answer 19

Upgrade Node: Replace existing machines with more powerful ones. Resource Enhancement: Add more RAM, CPU, storage, etc., to handle increased demands. Limitations: Costly, hardware limitations, single point of failure.

Answer 20

Costly Approach: Continuous investment in larger machines. Hardware Limitations: Eventually reaches a point of infeasibility with system growth. Single Point of Failure: Relying on one machine's resources.

Answer 21

Add More Nodes: Distribute load among multiple nodes to handle increased demands. Cost Effectiveness: Utilize cheaper hardware, scalability, and redundancy. Infinite Scaling Capacity: Potential for seemingly unlimited scalability.

Answer 22

Cost-Effectiveness: Running on cheaper hardware with added redundancy. Infinite Scaling: If managed well, offers seemingly unlimited scaling capacity. Avoid Single Points of Failure: Load distributed among multiple nodes, avoiding a single point of failure.

Answer 23

Necessity in Growth: Crucial for systems as they evolve and face increased user demands. Scaling Techniques: Vertical scaling becomes limited, while horizontal scaling is the industry standard for large systems.

Answer 24

A load balancer (LB) is basically a node in a distributed system that tries to evenly distribute traffic among the actual server nodes.

Answer 25

It can be some /status endpoint that the LB can hit and then check the response code. Based on the response, the LB can declare a node as healthy or unhealthy, and take further actions to make sure the system remains highly available

Answer 26

Purpose: Distributes incoming traffic across multiple server nodes. Even Load Distribution: Ensures no server node is overwhelmed with requests.

Answer 27

Traffic Management: Sits between clients and server nodes, routing requests. Distribution Algorithm: Routes requests based on predefined or combined algorithms.

Answer 28

Monitoring Nodes: Regularly checks server node health to avoid routing to unhealthy nodes. Actions on Unhealthy Nodes: Stops routing requests, triggers alerts, initiates new node setup for high availability.

Answer 29

Load-Triggered Scaling: Initiates node creation based on increased load to maintain system balance. Regular Node Status Checks: Periodically verifies the status of seemingly faulty nodes for reintegration.

Answer 30

SPoF Risk: A single load balancer can lead to complete system downtime if it fails. Redundancy Solution: Implementing multiple load balancers in a cluster with propagated IP addresses via DNS for failover.

Answer 31

Critical Component: Vital for system resilience in distributed setups. SPoF Mitigation: Conscious development and redundancy measures prevent load balancer-induced failures.

Answer 32

Hashing: Utilizes predefined attributes (e.g., user_ID) to generate a hash value mapped to server nodes. Endpoint Evaluation: Routes requests based on different endpoint targets (e.g., /photos, /videos) to respective server sets.

Answer 33

Random Selection: Routes requests randomly to any server node based on probability calculations. Round Robin: Distributes requests to servers in sequence (e.g., Server 1, Server 2, Server 3, repeat). Least Connection: Routes requests to the server with the fewest active connections. IP Hashing: Uses hashed source IP addresses to route requests to specific servers. Least Pending Requests: Routes incoming requests to the server with the fewest pending requests to optimize response times.

Answer 34

System-specific Selection: Decide between network-layer or application-layer load balancing based on system needs. Consider Request Patterns: Short request-response patterns may suit round-robin or random algorithms. System Response Times: Systems with varied response times benefit from algorithms like least connection or least response time. Client Behavior Impact: Algorithms relying on certain attributes (e.g., IP hashing) may face imbalance due to client behavior, necessitating careful consideration.

Answer 35

Calculates the rate of requests a node handles per second. Offers a simple metric, but may not accurately represent variations in load distribution across time periods.

Answer 36

Determines if a system is read-heavy or write-heavy based on the proportion of read and write operations. Read-heavy systems might need read replicas, while scaling write-heavy systems requires careful synchronization among nodes.

Answer 37

Data size, response-request size, number of concurrent users, etc., are system-specific metrics that can also gauge load.

Answer 38

Represents the time taken for a client to receive a response after sending a request. Average response time gives an overall view but might not reflect delays for specific users.

Answer 39

Percentiles, like p50, p95, p99, indicate the response time for a certain percentage of requests. Higher percentile values imply a slower response time for a smaller proportion of requests.

Answer 40

Monitoring load metrics and performance indicators informs decisions on scaling requirements. Higher QPS or response times beyond acceptable percentiles signal the need for scaling to meet increased demand.

Answer 41

Beyond a certain point, optimizing performance might incur high costs without significant benefit. Ensuring resources align with the system's needs is crucial for efficient operations.

Answer 42

This basically means how many queries or requests a node needs to handle each second. For example, say a server receives 10M requests per day.

Answer 43

An r/w value of means for each write, the database serves 10 read operations.

Answer 44

Maintainability Aspects: Ease of Operation Monitoring and Logging Automation Support Thorough Documentation Self-Healing Mechanisms Surprise Minimization

Answer 45

Ease of Extension Good Coding Practices Refactoring Habits Defined Deployment Procedures

Answer 46

Characteristics: Data-Centric Systems Intensified Data Handling Data as a Business Driver

Answer 47

Data Collection: Personal Data (Sensitive) User-Interaction Data (Logged Events) Data Storage: Various Storage Options Choice Based on Data Volume & Usage

Answer 48

Processing Stage: Data Transformation for Usability Challenge with Large Data Volume

Answer 49

Insights from Data: Useful Information for Business Growth Predictive Analytics & Behavior Modeling

Answer 50

Data Intensity Impact: Critical for Business Success Requires Proficiency in Data Handling Careful Collection, Processing, and Use of Data

Answer 51

Definition: Storing Identical Data Copies Across Multiple Machines Purpose: Enhancing Read/Write Capacity Increasing System Reliability

Answer 52

Network Connection: Essential for Data Sync Across Machines Communication for Data Changes Network Reliability: Critical for Data Replication

Answer 53

Complexities: Network Unreliability Machine Failures Storage Capacity Limitations Dynamic Data Handling: Challenges in Handling Changing Data

Answer 54

Scenarios: Large Data Volumes Multiple Machines Handling Data Continuous Data Changes Objective: Synchronized Updates Across All Nodes Consistent Data Availability

Answer 55

Ensures Availability Despite Node Failures Enhances System Scalability Critical for Consistent Data Responses Complexity with Volume: Increased Data Volume Adds Complexity to Replication

Introduction Flashcards

(86 cards)