Introduction Flashcards

1
Q

What is a distributed system ?

A

A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Definition of a Distributed System

A

Formal Definition:

A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system.
Attributes:

Comprises multiple computing nodes.
Nodes interact autonomously to achieve a common goal.
Users perceive the system as unified despite its distributed nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Computing Elements in a Distributed System

A

Definition:

In a distributed system, computing elements are autonomous entities capable of computation.
In context, these elements are physical machines (nodes) interacting to serve users or other systems.
Characteristics:

Nodes operate independently and can run distinct programs.
They communicate and coordinate to achieve system goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Coherent Operation in a Distributed System

A

User Perspective:

Users interacting with a distributed system experience a unified and coherent system.
The system’s operation appears seamless, and users’ expectations are met consistently.
Example:

Instagram functions as a distributed system, providing a smooth experience across its features despite running on multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Key Aspects of a Distributed System

A

Node Functionality:

Nodes interact to achieve shared objectives.
Each node performs autonomous tasks.
User Experience:

A distributed system presents a unified and consistent interface to users.
Users interact with the system without noticing its distributed nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Achieving Coherence

A

Objective:

Distributed systems aim to avoid inconsistencies in user experience.
Efforts focus on maintaining coherence across system functionalities.
Challenge:

Achieving coherence in a distributed system is complex due to its decentralized nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Most Obvious Distributed System

A

Example:

The internet serves as a widespread distributed system.
Networked nodes globally interact to facilitate data exchange and user interaction.
Characteristics:

Comprised of numerous interconnected nodes.
Seamlessly serves users’ requests across multiple servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Most Obvious Distributed System

A

Example:

The internet serves as a widespread distributed system.
Networked nodes globally interact to facilitate data exchange and user interaction.
Characteristics:

Comprised of numerous interconnected nodes.
Seamlessly serves users’ requests across multiple servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Simple Example: My Cool App

A

Scenario:

Development of a mobile app named “My Cool App” for user interaction.
Requires server interaction and data retrieval from a database.
Initial Setup:

Single node hosts both the server program and the database.
Potential risks include data loss on node failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Loneliness Mitigation

A

Strategy:

Separate the server and database programs onto distinct nodes.
Mitigates risks associated with single-node failure.
Enhancement:

Increases system resilience by distributing server and database functionalities across separate nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Strengthening the System

A

Enhancement Approach:

Expand the fleet by adding more server and database nodes.
Load balancing ensures even distribution of user requests across multiple server nodes.
Replication:

Database nodes contain replicated data for redundancy and data consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Transition to a Distributed System

A

Characteristics:

Transition from single-node to multi-node architecture.
System operates coherently and autonomously despite multiple nodes.
Goal:

Prevent system failure due to single-node issues.
Scale resources to handle increased user load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Key Takeaways

A

Load Handling:

Increased resources required to handle higher system loads.
Multi-node architecture mitigates single-node failure risks.
Autonomy and Coherence:

Nodes operate independently yet coherently to serve user needs.
Ensures a seamless experience despite the distributed architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hardware Limitations and Failures

A

Single nodes can crash due to hardware failures.
Scaling with a single node has limitations in terms of performance and capacity.
No hardware is infallible; everything degrades over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Consistency and Redundancy

A

Running multiple copies of programs on different nodes ensures consistency and redundancy.
Redundancy prevents a total system failure if one or more nodes go down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Scalability

A

As the user base grows, the system needs to scale to handle increased load.
Distributing server programs across multiple nodes helps manage the increased load efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

System Reliability and Load Balancing

A

Distributing programs across nodes prevents overwhelming a single node with excessive requests.
Redundancy ensures that even if some nodes fail, the system still functions, albeit partially.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Distributed Systems: Precautions and Considerations

A

Evaluating the need for a distributed system is crucial.
It’s essential to weigh the benefits against the complexity and costs.
Business requirements and user needs should guide the decision to adopt a distributed system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decision-Making for Distributed Systems

A

Critical questions about scalability, downtime, and resource utilization need to be addressed.
Careful analysis helps determine if the complexity of a distributed system aligns with business needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cautionary Approach

A

While distributed systems may seem attractive, careful evaluation is necessary before implementation.
Understanding the business requirements and use-case scenarios helps in making informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does fault tolerance mean in the context of distributed systems?

A

Fault tolerance in distributed systems refers to the system’s ability to continue operating and providing its services even in the presence of faults or failures in different components or nodes within the system. It involves designing the system to handle and recover from faults to maintain functionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why we need fault tolerance?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Other Fault Tolerance

A

Examples of Faults in Distributed Systems

Highlight various scenarios like hardware failures, software bugs, network disruptions, and sudden traffic spikes that can cause faults in distributed systems.
Importance of Fault Tolerance

Why is fault tolerance essential in distributed systems? Discuss how it prevents system-wide failures and ensures continuous operation despite faults.
Strategies for Achieving Fault Tolerance

Explain different strategies like redundancy, replication, error handling, and graceful degradation used to achieve fault tolerance in distributed systems.
Challenges in Implementing Fault Tolerance

Explore the difficulties and complexities involved in building fault-tolerant distributed systems, such as maintaining consistency and performance while tolerating faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the definition of reliability in the context of distributed systems?

A

Reliability in distributed systems refers to the system’s capability to function correctly even when facing faults or failures in various components or nodes within the system. It involves ensuring that the system continues to serve users’ expectations, handles errors gracefully, remains performant, and endures malicious attacks or abuses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Fault Tolerance:

A

Definition: Fault tolerance refers to a system’s ability to continue functioning, albeit in a degraded mode or reduced capacity, in the presence of faults or failures within its components. These faults could be hardware failures, software bugs, or network disruptions.

Focus: It primarily emphasizes the system’s ability to handle and recover from faults without causing complete system failure or severe disruption to its operations.

Strategies: Fault tolerance involves implementing mechanisms and strategies that enable a system to detect, isolate, and recover from faults. These may include redundancy, replication, error detection and correction codes, self-healing mechanisms, and graceful degradation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Reliability:

A

Definition: Reliability in a distributed system refers to the system’s capability to consistently provide correct and expected functionality, delivering the intended service or output even in the face of faults or failures.

Focus: It emphasizes the overall dependability and consistency of the system in fulfilling its intended purpose or service, regardless of occasional faults that may occur within its components.

Scope: Reliability encompasses not only fault tolerance but also the system’s ability to maintain correctness, availability, and performance, meeting users’ expectations under normal and fault scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Fault Tolerance, Reliability Differences

A

Emphasis: Fault tolerance focuses on the system’s ability to handle faults and continue operating despite the presence of failures, whereas reliability focuses on consistently delivering correct and expected functionality regardless of faults.

Approach: Fault tolerance involves implementing specific mechanisms and strategies to address faults and mitigate their impact on the system. Reliability, on the other hand, is a broader concept encompassing the overall trustworthiness and dependability of the system.

Outcome: The goal of fault tolerance is to minimize the impact of faults by providing mechanisms for fault recovery and resilience. Reliability aims to ensure consistent and correct system behavior, maintaining service quality and user expectations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Fault Tolerance:

A

Definition: System’s ability to continue operating despite component faults or failures.

Focus: Handling faults without causing complete system failure.

Strategies: Redundancy, replication, error detection, self-healing mechanisms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Reliability:

A

Definition: System’s consistency in providing correct functionality even with faults.

Focus: Consistently delivering expected services despite occasional faults.

Scope: Encompasses correctness, availability, and meeting user expectations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Handling Hardware Faults:

A

RAID: Redundant Array of Independent Disks to store data redundantly across multiple drives.

Purpose: Ensures data recovery in case of disk failures, reducing downtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Handling Software Faults:

A

Testing: Writing unit, integration, and end-to-end tests to ensure code correctness and interaction reliability.

Error Handling: Implementing robust error handling across the codebase to manage unexpected behaviors.

Monitoring: Continuous monitoring of critical system parts, adding logs, and using third-party services for immediate issue detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Summary:

A

Hardware Faults: Mitigated using RAID for redundancy, ensuring data retrieval and system recovery.

Software Faults: Managed through rigorous testing, error handling, and active system monitoring for fault detection and response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Availability Definition:

A

Property: Ensures the system is ready to serve users when needed.

Calculation: Availability (A) = (U / (U + D)), where U is the time system was up, and D is the time system was down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Motivation for Availability:

A

Business Impact: Downtime equals potential monetary loss and user dissatisfaction.

Use Cases: Critical for e-commerce, learning platforms, and any service dependent on user accessibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Availability Chart:

A

Percentage Allowed Downtime
90% : 36.5 days
95% : 18.25 days
99% : 3.65 days
99.5% : 1.83 days
99.9% : 8.76 hours
99.95% : 4.38 hours
99.99% : 52.6 minutes
99.999% : 5.26 minutes

36
Q

Practices and Key Considerations:

A

System’s Expectation: Aim for higher availability percentages to reduce downtime.

Denoting Availability: Represented by “nines” (e.g., 99.99% is 4-nines).

37
Q

SPOF

A

Definition: A component whose failure can bring down the entire system.

Example: Home router: Failure leads to loss of internet access.

38
Q

Achieving Availability:

A

Redundancy: Add backup resources to handle system failures and avoid Single Points of Failure (SPoF).

Expectations: Set realistic availability expectations; even the best systems can face rare, simultaneous failures.

39
Q

SLA, SLO, and SLI:

A

Service Level Agreement (SLA): Agreement between system and clients outlining requirements, e.g., uptime or response time.

Service Level Objective (SLO): Specific promise in an SLA, e.g., loading feeds within 200ms.

Service Level Indicator (SLI): Actual measured metrics compared to the SLA/SLO, e.g., promised uptime vs. actual uptime.

40
Q

Measurement and Improvement:

A

Continuous Tuning: SLA, SLO, and SLI measurements tune expectations over time for better system improvements.

Focus on Improvements: These metrics guide system-level improvements and focus areas for better service delivery.

41
Q

Scalability Definition:

A

Capability: It’s the system’s ability to handle increased usage or load efficiently.

Increased Load: Addressing system capability when user count or activity surges.

42
Q

Scalability Questions:

A

System Impact: How does the user base increase affect the system’s performance?

Handling Increased Load: Can the system manage the amplified load without issues?

Resource Addition: Strategies for seamlessly adding computing resources to accommodate the load surge.

43
Q

Vertical Scaling (Scaling Up):

A

Upgrade Node: Replace existing machines with more powerful ones.

Resource Enhancement: Add more RAM, CPU, storage, etc., to handle increased demands.

Limitations: Costly, hardware limitations, single point of failure.

44
Q

Vertical Scaling Limitations:

A

Costly Approach: Continuous investment in larger machines.

Hardware Limitations: Eventually reaches a point of infeasibility with system growth.

Single Point of Failure: Relying on one machine’s resources.

45
Q

Horizontal Scaling (Scaling Out):

A

Add More Nodes: Distribute load among multiple nodes to handle increased demands.

Cost Effectiveness: Utilize cheaper hardware, scalability, and redundancy.

Infinite Scaling Capacity: Potential for seemingly unlimited scalability.

46
Q

Benefits of Horizontal Scaling:

A

Cost-Effectiveness: Running on cheaper hardware with added redundancy.

Infinite Scaling: If managed well, offers seemingly unlimited scaling capacity.

Avoid Single Points of Failure: Load distributed among multiple nodes, avoiding a single point of failure.

47
Q

Scalability Conclusion:

A

Necessity in Growth: Crucial for systems as they evolve and face increased user demands.

Scaling Techniques: Vertical scaling becomes limited, while horizontal scaling is the industry standard for large systems.

48
Q

What is a load balancer?

A

A load balancer (LB) is basically a node in a distributed system that tries to evenly distribute traffic among the actual server nodes.

49
Q

Unhealthy nodes can be detected by the load balancer

A

It can be some /status endpoint that the LB can hit and then check the response code. Based on the response, the LB can declare a node as healthy or unhealthy, and take further actions to make sure the system remains highly available

50
Q

Load Balancer Definition:

Purpose: Distributes incoming traffic across multiple server nodes.

Even Load Distribution: Ensures no server node is overwhelmed with requests.

Load Balancer Role:

Traffic Management: Sits between clients and server nodes, routing requests.

Distribution Algorithm: Routes requests based on predefined or combined algorithms.

A
51
Q

Load Balancer Definition:

A

Purpose: Distributes incoming traffic across multiple server nodes.

Even Load Distribution: Ensures no server node is overwhelmed with requests.

52
Q

Load Balancer Role:

A

Traffic Management: Sits between clients and server nodes, routing requests.

Distribution Algorithm: Routes requests based on predefined or combined algorithms.

53
Q

Health Checks and High Availability:

A

Monitoring Nodes: Regularly checks server node health to avoid routing to unhealthy nodes.

Actions on Unhealthy Nodes: Stops routing requests, triggers alerts, initiates new node setup for high availability.

54
Q

System Scaling:

A

Load-Triggered Scaling: Initiates node creation based on increased load to maintain system balance.

Regular Node Status Checks: Periodically verifies the status of seemingly faulty nodes for reintegration.

55
Q

Load Balancer Single Point of Failure:

A

SPoF Risk: A single load balancer can lead to complete system downtime if it fails.

Redundancy Solution: Implementing multiple load balancers in a cluster with propagated IP addresses via DNS for failover.

56
Q

Key Significance:

A

Critical Component: Vital for system resilience in distributed setups.

SPoF Mitigation: Conscious development and redundancy measures prevent load balancer-induced failures.

57
Q

Application-layer Algorithms:

A

Hashing:
Utilizes predefined attributes (e.g., user_ID) to generate a hash value mapped to server nodes.
Endpoint Evaluation:
Routes requests based on different endpoint targets (e.g., /photos, /videos) to respective server sets.

58
Q

Network-layer Algorithms:

A

Random Selection:

Routes requests randomly to any server node based on probability calculations.
Round Robin:

Distributes requests to servers in sequence (e.g., Server 1, Server 2, Server 3, repeat).
Least Connection:

Routes requests to the server with the fewest active connections.
IP Hashing:

Uses hashed source IP addresses to route requests to specific servers.
Least Pending Requests:

Routes incoming requests to the server with the fewest pending requests to optimize response times.

59
Q

Choosing the Right Algorithm:

A

System-specific Selection:

Decide between network-layer or application-layer load balancing based on system needs.
Consider Request Patterns:

Short request-response patterns may suit round-robin or random algorithms.
System Response Times:

Systems with varied response times benefit from algorithms like least connection or least response time.
Client Behavior Impact:

Algorithms relying on certain attributes (e.g., IP hashing) may face imbalance due to client behavior, necessitating careful consideration.

60
Q

End Point LB

A
61
Q

Measuring load in distributed systems

Queries per Second (QPS):

A

Calculates the rate of requests a node handles per second.
Offers a simple metric, but may not accurately represent variations in load distribution across time periods.

62
Q

Read-to-Write Ratio (r/w):

A

Determines if a system is read-heavy or write-heavy based on the proportion of read and write operations.
Read-heavy systems might need read replicas, while scaling write-heavy systems requires careful synchronization among nodes.

63
Q

Additional Load Metrics:

A

Data size, response-request size, number of concurrent users, etc., are system-specific metrics that can also gauge load.

64
Q

Measuring Performance

Response Time:

A

Represents the time taken for a client to receive a response after sending a request.
Average response time gives an overall view but might not reflect delays for specific users.

65
Q

Percentiles:

A

Percentiles, like p50, p95, p99, indicate the response time for a certain percentage of requests.
Higher percentile values imply a slower response time for a smaller proportion of requests.

66
Q

Deciding When to Scale:

A

Monitoring load metrics and performance indicators informs decisions on scaling requirements.
Higher QPS or response times beyond acceptable percentiles signal the need for scaling to meet increased demand.

67
Q

Business Considerations:

A

Beyond a certain point, optimizing performance might incur high costs without significant benefit.
Ensuring resources align with the system’s needs is crucial for efficient operations.

68
Q

QPS

A

This basically means how many queries or requests a node needs to handle each second. For example, say a server receives
10M requests per day.

69
Q

Read-to-write ratio

A

An r/w value of means for each write, the database serves
10 read operations.

70
Q

average response time

A
71
Q

Maintainability of Distributed Systems

A

Maintainability Aspects:

Ease of Operation
Monitoring and Logging
Automation Support
Thorough Documentation
Self-Healing Mechanisms
Surprise Minimization

71
Q
A
71
Q

Maintainability Aspects:

A

Ease of Extension
Good Coding Practices
Refactoring Habits
Defined Deployment Procedures

72
Q

Simple flow of data in a system

A
73
Q

Data-Driven Modern Systems

A

Characteristics:

Data-Centric Systems
Intensified Data Handling
Data as a Business Driver

74
Q

Data Handling in Modern Systems

A

Data Collection:

Personal Data (Sensitive)
User-Interaction Data (Logged Events)
Data Storage:

Various Storage Options
Choice Based on Data Volume & Usage

75
Q

Data Processing in Modern Systems

A

Processing Stage:

Data Transformation for Usability
Challenge with Large Data Volume

76
Q

Insights Generation

A

Insights from Data:

Useful Information for Business Growth
Predictive Analytics & Behavior Modeling

77
Q

Key Takeaways

A

Data Intensity Impact:

Critical for Business Success
Requires Proficiency in Data Handling
Careful Collection, Processing, and Use of Data

78
Q

Replication

A
79
Q

Data Replication Essentials

A

Definition:

Storing Identical Data Copies Across Multiple Machines
Purpose:

Enhancing Read/Write Capacity
Increasing System Reliability

80
Q

Importance of Network Connectivity

A

Network Connection:

Essential for Data Sync Across Machines
Communication for Data Changes
Network Reliability:

Critical for Data Replication

81
Q

Challenges in Replication

A

Complexities:

Network Unreliability
Machine Failures
Storage Capacity Limitations
Dynamic Data Handling:

Challenges in Handling Changing Data

82
Q

Replication Process

A

Scenarios:

Large Data Volumes
Multiple Machines Handling Data
Continuous Data Changes
Objective:

Synchronized Updates Across All Nodes
Consistent Data Availability

83
Q

Replication Benefits:

A

Ensures Availability Despite Node Failures
Enhances System Scalability
Critical for Consistent Data Responses
Complexity with Volume:

Increased Data Volume Adds Complexity to Replication

84
Q
A