Distributed Data Flashcards
What does system reliability mean?
The ability to carry out its functions consistently and without the overall system failing
What need to be combined to make a system resilient?
Different hardware and solftware solutions
What are the hardware and software solutions used in system resilitence?
- Redundant Hardware
- Data replication
- Load Balancing
- Data Backup and Recovery
- Error Handling
- Monitoring and maintenance
What is the role of redundant hardware in system resilience?
Uses multiple devices to carru out the same tasks such as disks, power supplies or network interfaces
What is the role of data replication in system resilience?
- Enables parallel processing and lower latency (in the case of close geographic data)
What is the role of load balancing in system resilience?
Distributes the workload across different components to improve system availability
What is the role of data backup and recovery in system resilience?
- Managing backups and restoration
- Backups should be regular, and stored seperately and securely
- A recovery plan should be in place to outline how backups are restored
What is the role of error management in system resilience?
Automatic detection and management of errors
What is the role of monitoring and maintenance in system resilience?
Reviewing system performance to prevent future incidents
What are the three aspects of the CAP Theorem?
- Consistency
- Availability
- Partition tolerance
What does the CAP theorem state?
Only two out of the three aspects can be effective in a data system
What is the CAP Theorem’s consistency aspect?
Ensuring data stored in different locations is always the same even after an update
What is the CAP Theorem’s Availability aspect?
Data systems are always operational and responsive
What is the CAP Theorem’s partition tolerance aspect?
Data systems remain functional even if nodes crash or lose communication
Why can we only choose between availability and consistency in distributed data systems?
- Partition tolerant by defiinition
- Only leaves a choice between availability and consistency
What is data replication?
Vital for ensuring the reliability of data-intensive systems
What benefits does data replication provide?
- Increased system availability
- Reduced risk of data loss
- Enables disaster recovery
- Improved performance
What are the advantages of using data replication?
- Availability
- Data backup and system recovery
- Load balancing
- Performance improvement
What are the different data replication strategies?
- Master-slave replication
- Multi-leader replication
*Leaderless replication
What is master-slave replication?
Master node receives all updates and replicates the data to other nodes
What is multi-leader replication?
- Multiple master nodes which are simultaneously slave nodes to other master nodes
- More resilient to master node failure
What is leaderless replication?
- Each node acts as a master and slave simultaneously
- Writes accepted by all nodes and replicated to other nodes
- Presents challenges with data consistency
What topologies does leaderless replication use?
- Circular
- Star
- All-to-all
What criteria should be used when choosing a data replication strategy?
- Size and complexity of data
- Acceptable latency between updates
- Required availability or consistency
- Disaster recovery capacity
What is data replication in the cloud?
Distributing data across nodes as well as geopgraphically spread locations
What are the different types of cloud data replication?
- Geographic replication
- Cross-region replication
- Zone-redundant replication
What is geographic replication?
- Creating multiple data copies in geographically dispersed locations
- Provides robustness against disasters affecting a broad geographic location (natural disasters/military attacks)
What is cross-region replication?
- Distributes data copies across wider geographic areas such as continents and sub-continents
- Provides low latency access from different global regions
- Provides robustness against regional failures
What is zone-redundant replication?
- Multiple data copies stored across different availability zones within a single cloud region
- Provides robustness against zone failures
What are examples of zone-redundant solutions?
- AWS: Amazon S3 Cross-region replication
- Azure: Geo-Redundant Storage (GRS)
What is data partitioning?
- Dividing large datasets into smaller parts (called partitions)
- Partitions are distributed across nodes
Why is data partitioning used?
- Reliability
- Better availability
- Improved processing performance / parallel processing
What are the two types of data partitioning?
- Vertical
- Horizontal
What is vertical partitioning?
Splitting a table into multiple tables by columns
What is horizontal partitioning?
- Known as “sharding”
- Splits up tables by row
- Rows are stored in different clusters
What are the disadvantages of data partitioning?
- Requires additional computation and network resources
- More complex than single partition strategies
What are the different sharding strategies?
- Round-robin
- Hash
- Range-based
- Composite
What is round-robin partitioning?
Distributing data between partitions in the same proportion
What are the advantages of round-robin partitioning?
- Straight forward
- Appropriate for evenly distributed data
- No additional information needed to create partitions
What are the disadvantages of round-robin partitioning?
Unsuited for skewed data distributions
What is hash partitioning?
- Also called “key based partitioning”
- Calculates hash values based on data attributes
What are the advantages of hash partitioning?
- Records with similar values are stored in the same partition
- Can be used with skewed data distributions as partitions can be controlled
What are the disadvantages of hash partitioning?
- Requires additional information to be able to define the partition
- Hash collisions can mape records with different attributes to the same partition
What is range-based partitioning?
- Based on particular attributes
- Uses sequential keys with equal intervals
What are the advantages of range-based partitioning?
- Appropriate for attributes with a natural range of values
- Partitions are a meaningful division of the records
What are the disadvantages of range-based partitioning?
- Imbalanced partitions if the values are unevenly distributed