Distributed Data Flashcards

Question 1

Q

What does system reliability mean?

Answer

A

The ability to carry out its functions consistently and without the overall system failing

Question 2

Q

What need to be combined to make a system resilient?

Answer

A

Different hardware and solftware solutions

Question 3

Q

What are the hardware and software solutions used in system resilitence?

Answer

A

Redundant Hardware
Data replication
Load Balancing
Data Backup and Recovery
Error Handling
Monitoring and maintenance

Question 4

Q

What is the role of redundant hardware in system resilience?

Answer

A

Uses multiple devices to carru out the same tasks such as disks, power supplies or network interfaces

Question 5

Q

What is the role of data replication in system resilience?

Answer

A

Enables parallel processing and lower latency (in the case of close geographic data)

Question 6

Q

What is the role of load balancing in system resilience?

Answer

A

Distributes the workload across different components to improve system availability

Question 7

Q

What is the role of data backup and recovery in system resilience?

Answer

A

Managing backups and restoration
Backups should be regular, and stored seperately and securely
A recovery plan should be in place to outline how backups are restored

Question 8

Q

What is the role of error management in system resilience?

Answer

A

Automatic detection and management of errors

Question 9

Q

What is the role of monitoring and maintenance in system resilience?

Answer

A

Reviewing system performance to prevent future incidents

Question 10

Q

What are the three aspects of the CAP Theorem?

Answer

A

Consistency
Availability
Partition tolerance

Question 11

Q

What does the CAP theorem state?

Answer

A

Only two out of the three aspects can be effective in a data system

Question 12

Q

What is the CAP Theorem’s consistency aspect?

Answer

A

Ensuring data stored in different locations is always the same even after an update

Question 13

Q

What is the CAP Theorem’s Availability aspect?

Answer

A

Data systems are always operational and responsive

Question 14

Q

What is the CAP Theorem’s partition tolerance aspect?

Answer

A

Data systems remain functional even if nodes crash or lose communication

Question 15

Q

Why can we only choose between availability and consistency in distributed data systems?

Answer

A

Partition tolerant by defiinition
Only leaves a choice between availability and consistency

Question 16

Q

What is data replication?

Answer

A

Vital for ensuring the reliability of data-intensive systems

Question 17

Q

What benefits does data replication provide?

Answer

A

Increased system availability
Reduced risk of data loss
Enables disaster recovery
Improved performance

Question 18

Q

What are the advantages of using data replication?

Answer

A

Availability
Data backup and system recovery
Load balancing
Performance improvement

Question 19

Q

What are the different data replication strategies?

Answer

A

Master-slave replication
Multi-leader replication
*Leaderless replication

Question 20

Q

What is master-slave replication?

Answer

A

Master node receives all updates and replicates the data to other nodes

Question 21

Q

What is multi-leader replication?

Answer

A

Multiple master nodes which are simultaneously slave nodes to other master nodes
More resilient to master node failure

Question 22

Q

What is leaderless replication?

Answer

A

Each node acts as a master and slave simultaneously
Writes accepted by all nodes and replicated to other nodes
Presents challenges with data consistency

Question 23

Q

What topologies does leaderless replication use?

Answer

A

Circular
Star
All-to-all

Question 24

Q

What criteria should be used when choosing a data replication strategy?

Answer

A

Size and complexity of data
Acceptable latency between updates
Required availability or consistency
Disaster recovery capacity

Question 25

Q

What is data replication in the cloud?

Answer

A

Distributing data across nodes as well as geopgraphically spread locations

Question 26

Q

What are the different types of cloud data replication?

Answer

A

Geographic replication
Cross-region replication
Zone-redundant replication

Question 27

Q

What is geographic replication?

Answer

A

Creating multiple data copies in geographically dispersed locations
Provides robustness against disasters affecting a broad geographic location (natural disasters/military attacks)

Question 28

Q

What is cross-region replication?

Answer

A

Distributes data copies across wider geographic areas such as continents and sub-continents
Provides low latency access from different global regions
Provides robustness against regional failures

Question 29

Q

What is zone-redundant replication?

Answer

A

Multiple data copies stored across different availability zones within a single cloud region
Provides robustness against zone failures

Question 30

Q

What are examples of zone-redundant solutions?

Answer

A

AWS: Amazon S3 Cross-region replication
Azure: Geo-Redundant Storage (GRS)

Question 31

Q

What is data partitioning?

Answer

A

Dividing large datasets into smaller parts (called partitions)
Partitions are distributed across nodes

Question 32

Q

Why is data partitioning used?

Answer

A

Reliability
Better availability
Improved processing performance / parallel processing

Question 33

Q

What are the two types of data partitioning?

Answer

A

Vertical
Horizontal

Question 34

Q

What is vertical partitioning?

Answer

A

Splitting a table into multiple tables by columns

Question 35

Q

What is horizontal partitioning?

Answer

A

Known as “sharding”
Splits up tables by row
Rows are stored in different clusters

Question 36

Q

What are the disadvantages of data partitioning?

Answer

A

Requires additional computation and network resources
More complex than single partition strategies

Question 37

Q

What are the different sharding strategies?

Answer

A

Round-robin
Hash
Range-based
Composite

Question 38

Q

What is round-robin partitioning?

Answer

A

Distributing data between partitions in the same proportion

Question 39

Q

What are the advantages of round-robin partitioning?

Answer

A

Straight forward
Appropriate for evenly distributed data
No additional information needed to create partitions

Question 40

Q

What are the disadvantages of round-robin partitioning?

Answer

A

Unsuited for skewed data distributions

Question 41

Q

What is hash partitioning?

Answer

A

Also called “key based partitioning”
Calculates hash values based on data attributes

Question 42

Q

What are the advantages of hash partitioning?

Answer

A

Records with similar values are stored in the same partition
Can be used with skewed data distributions as partitions can be controlled

Question 43

Q

What are the disadvantages of hash partitioning?

Answer

A

Requires additional information to be able to define the partition
Hash collisions can mape records with different attributes to the same partition

Question 44

Q

What is range-based partitioning?

Answer

A

Based on particular attributes
Uses sequential keys with equal intervals

Question 45

Q

What are the advantages of range-based partitioning?

Answer

A

Appropriate for attributes with a natural range of values
Partitions are a meaningful division of the records

Question 46

Q

What are the disadvantages of range-based partitioning?

Answer

A

Imbalanced partitions if the values are unevenly distributed

Question 47

Q

How do cloud solutions use horizontal partitioning?

Answer

A

Amazon Dynamo and Aurora spread partitions across cluster nodes
Azure Cosmos DB spreads partitions across different geographic regions

Question 48

Q

What are partitioning strategies available for non-database use?

Answer

A

Directory-based partitioning
Geospatial partitioning

Question 49

Q

What is directory-based partitioning?

Answer

A

Divides data into folders in a file hierarchy
Division is based on attributes, with data of data creation also considered

Question 50

Q

What is geospatial partitioning?

Answer

A

Considers geographic locations
Used when low latency is needed or data needs to be processed close to the origin
Used to comply with data protection regulations or data sovreignity

Question 51

Q

What are examples of non-database cloud partitioning solutions?

Answer

A

Azure Blob Storage & Azure Data Lake Storage - partition using containers and folders which may be located in different regions
Amazon S3 storage uses buckets and prefixes for partitions

Question 52

Q

What components make up distributed data processing?

Answer

A

Data partitioning
Data shuffling
Task scheduling
Data-based code execution
Data Storage
Fault tolerance
Performance optimisation

Question 53

Q

How is data shuffling used in distributed data processing?

Answer

A

Reviews data distributions to manage load balancing and ensure efficient processing

Question 54

Q

How is task shuffling used in distributed data processing?

Answer

A

Assigns tasks to nodes
Supervises execution
Handles execution failures

Question 55

Q

How is data -based code execution used in distributed data processing?

Answer

A

Brings the code to the data
Constasts the tradition approach of loading data into a centralised processing environment

Question 56

Q

How is data storage used in distributed data processing?

Answer

A

Ensures fast access to data for processing, regardless of whether data is stored on disk or in memory

Question 57

Q

How is fault tolerance used in distributed data processing?

Answer

A

Detects and manages failures
Ensures continuity of processing is a node fails

Question 58

Q

How is performance optimisation used in distributed data processing?

Answer

A

Focuses on reducing data movements, intermediate steps and overhead communication

Question 59

Q

What is Hadoop Distributed File System?

Answer

A

Distributed data storage
Data storage foundation for the Hadoop Ecosystem
Implements redundant copies of data to provide high availability and fault tolerance
Unified view of data at the logical level (files/folders)
Resources are split internally and stored on different nodes

Question 60

Q

What is the storage capacity of HDFS?

Answer

A

Petabytes of data

Question 61

Q

How can HDFS be scaled?

Answer

A

Vertically: increasing node capacity
Horizontally: adding nodes to the cluster

Question 62

Q

What is the configuration of HDFS?

Answer

A

Master-Slave configuration

Question 63

Q

What are the components of HDFS?

Answer

A

NameNodes
DataNodes

Question 64

Q

What does a HDFS NameNode do?

Answer

A

Maintains access to resources
Keeps system metadata
Maintains a table to map data blocks to DataNodes

Answer 65

A

Responsible for data storage
Made up of commodity hardware with several disks for a large storage capacity

Answer 66

A

High tolerance for failures
High availability
Addresses failover management at the application level

Answer 67

A

Parallelisable: able to process tasks simultaneously on several machines

Answer 68

A

At least 3 nodes

Answer 69

A

The request is redirected to another DataNode

Answer 70

A

Failure state so that rew replication can be scheduled if the NameNode detects a failure

Answer 71

A

To provide large scale distributed computing

Answer 72

A

Map
Reduce

Answer 73

A

Takes an input value
Carries out stateless computation
Outputs a key-value pair

Answer 74

A

To keep data transfers to a minimum

Answer 75

A

An intermediate step between Map and Reduce
Sorts all of the keys

Answer 76

A

The values of the sorted keys are aggregated by key

Answer 77

A

Rule based filtering
Aggregation
Counting
Descriptive statistics

Answer 78

A

Google’s PageRank algorithm

Answer 79

A

Expects frequent node failures and failovers

Answer 80

A

Frequent read and writes to disk due to repeatable checkpoints to provide fault tolerance
Programming complex computations is an intricate and difficult task

Answer 81

A

Has a high-level abstraction of the data processing
Develops programs with a simplified scripting language which uses MapReduce under the hood

Answer 82

A

Adds higher level abstraction
Provides a SQL-like interface for grouping, querying and joining data

Answer 83

A

Provides a high level abstraction of data storage and processing
Faster
Can utilise a number of libraries
Groups intermediate steps and keeps them in memory as a Directed Acyclic Graph

Answer 84

A

On a single machine
Within a cluster of nodes

Answer 85

A

Batch processing
Stream processing

Answer 86

A

PYthon
R
Scala
Java

Answer 87

A

The foundation of Spark
Uses in-memory computation based on Resilient Distributed Datasets

Answer 88

A

A immutable collection of objects distributed across multiple cluster nodes

Answer 89

A

Each object contains data objects
Resilient refers to data replication and being able to avoid data loss by recovering from node failures

Answer 90

A

HDFS
S3
Relational Databases
NoSQL databases

Answer 91

A

Spark SQL
Spark Streaming
Spark MLlib
Spark Graph X
PySpark

Answer 92

A

Framework for processing structured data
SQL and DataFrames can be used to query/work with various data sources

Answer 93

A

Used for batch and stream processing
Provides scalable, high throughtput and fault tolerant stream processing from different streaming services such as Kafka
Uses mini-batches for stream processing

Answer 94

A

Batch interval limited to seconds or less

Answer 95

A

Machine Learning library
Implements a range of algorithms
Used in large scale machine learning

Answer 96

A

Used for distributed graph processing

Answer 97

A

Python library
Used for writing parallelisable code for data processing and machine learning
Provdes data processing across multiple cluster nodes

Answer 98

A

Mapping
Filtering
Joining
Group-by-key operations

Answer 99

A

A framework for processing distributed data streams

Answer 100

A

Uses Directed Acyclic Graphs made up of small, discrete operationd that make up the data transformation process

Answer 101

A

Streaming data are transported between nodes, travelling along edges
A particular data transformation takes place at each node

Answer 102

A

A framework for processing distributed data streams

Answer 103

A

Allows for stafeul applications processing data from streaming resources such as Kafka

Answer 104

A

Producers: write to a topic
Consumers: read data from a topic
Brokers: handle the data processing in a cluster of machines

Answer 105

A

Incoming data are sorted into topics

Answer 106

A

A framework for processing distributed data streams
High performance solution

Answer 107

A

Implements batch and stream processing
Executes stateful processing such as windowingprocessing

Answer 108

A

Input data is read from streams and partitioned for parallel processing
Data transformed using map, filter or reduce operations
Data is aggregated, has windowingprocessing applied and then output to a data sink

Answer 109

A

Uses checkpoints of the data stream state

Answer 110

A

Processes partitioned S3 data

Answer 111

A

Managed cloud service for processing big data in a scalable and fault tolerant way
Uses Hadoop, Spark or Hive

Answer 112

A

Managed platform for big data analysis and machine learning in the cloud
Deployed on spark

Answer 113

A

Data ingestion solutions for big data analysis and machine learning in the cloud
User management and utility tools
Highly scalable and can process large amounts of data

Brainscape's Knowledge GenomeTM

Distributed Data Flashcards

Brainscape's Knowledge Genome^TM