Distributed Data Flashcards

1
Q

What does system reliability mean?

A

The ability to carry out its functions consistently and without the overall system failing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What need to be combined to make a system resilient?

A

Different hardware and solftware solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the hardware and software solutions used in system resilitence?

A
  • Redundant Hardware
  • Data replication
  • Load Balancing
  • Data Backup and Recovery
  • Error Handling
  • Monitoring and maintenance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of redundant hardware in system resilience?

A

Uses multiple devices to carru out the same tasks such as disks, power supplies or network interfaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of data replication in system resilience?

A
  • Enables parallel processing and lower latency (in the case of close geographic data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the role of load balancing in system resilience?

A

Distributes the workload across different components to improve system availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the role of data backup and recovery in system resilience?

A
  • Managing backups and restoration
  • Backups should be regular, and stored seperately and securely
  • A recovery plan should be in place to outline how backups are restored
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the role of error management in system resilience?

A

Automatic detection and management of errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the role of monitoring and maintenance in system resilience?

A

Reviewing system performance to prevent future incidents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three aspects of the CAP Theorem?

A
  • Consistency
  • Availability
  • Partition tolerance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the CAP theorem state?

A

Only two out of the three aspects can be effective in a data system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the CAP Theorem’s consistency aspect?

A

Ensuring data stored in different locations is always the same even after an update

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the CAP Theorem’s Availability aspect?

A

Data systems are always operational and responsive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the CAP Theorem’s partition tolerance aspect?

A

Data systems remain functional even if nodes crash or lose communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why can we only choose between availability and consistency in distributed data systems?

A
  • Partition tolerant by defiinition
  • Only leaves a choice between availability and consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is data replication?

A

Vital for ensuring the reliability of data-intensive systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What benefits does data replication provide?

A
  • Increased system availability
  • Reduced risk of data loss
  • Enables disaster recovery
  • Improved performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the advantages of using data replication?

A
  • Availability
  • Data backup and system recovery
  • Load balancing
  • Performance improvement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the different data replication strategies?

A
  • Master-slave replication
  • Multi-leader replication
    *Leaderless replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is master-slave replication?

A

Master node receives all updates and replicates the data to other nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is multi-leader replication?

A
  • Multiple master nodes which are simultaneously slave nodes to other master nodes
  • More resilient to master node failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is leaderless replication?

A
  • Each node acts as a master and slave simultaneously
  • Writes accepted by all nodes and replicated to other nodes
  • Presents challenges with data consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What topologies does leaderless replication use?

A
  • Circular
  • Star
  • All-to-all
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What criteria should be used when choosing a data replication strategy?

A
  • Size and complexity of data
  • Acceptable latency between updates
  • Required availability or consistency
  • Disaster recovery capacity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is data replication in the cloud?

A

Distributing data across nodes as well as geopgraphically spread locations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the different types of cloud data replication?

A
  • Geographic replication
  • Cross-region replication
  • Zone-redundant replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is geographic replication?

A
  • Creating multiple data copies in geographically dispersed locations
  • Provides robustness against disasters affecting a broad geographic location (natural disasters/military attacks)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is cross-region replication?

A
  • Distributes data copies across wider geographic areas such as continents and sub-continents
  • Provides low latency access from different global regions
  • Provides robustness against regional failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is zone-redundant replication?

A
  • Multiple data copies stored across different availability zones within a single cloud region
  • Provides robustness against zone failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are examples of zone-redundant solutions?

A
  • AWS: Amazon S3 Cross-region replication
  • Azure: Geo-Redundant Storage (GRS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is data partitioning?

A
  • Dividing large datasets into smaller parts (called partitions)
  • Partitions are distributed across nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why is data partitioning used?

A
  • Reliability
  • Better availability
  • Improved processing performance / parallel processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the two types of data partitioning?

A
  • Vertical
  • Horizontal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is vertical partitioning?

A

Splitting a table into multiple tables by columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is horizontal partitioning?

A
  • Known as “sharding”
  • Splits up tables by row
  • Rows are stored in different clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the disadvantages of data partitioning?

A
  • Requires additional computation and network resources
  • More complex than single partition strategies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the different sharding strategies?

A
  • Round-robin
  • Hash
  • Range-based
  • Composite
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is round-robin partitioning?

A

Distributing data between partitions in the same proportion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the advantages of round-robin partitioning?

A
  • Straight forward
  • Appropriate for evenly distributed data
  • No additional information needed to create partitions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the disadvantages of round-robin partitioning?

A

Unsuited for skewed data distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is hash partitioning?

A
  • Also called “key based partitioning”
  • Calculates hash values based on data attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the advantages of hash partitioning?

A
  • Records with similar values are stored in the same partition
  • Can be used with skewed data distributions as partitions can be controlled
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are the disadvantages of hash partitioning?

A
  • Requires additional information to be able to define the partition
  • Hash collisions can mape records with different attributes to the same partition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is range-based partitioning?

A
  • Based on particular attributes
  • Uses sequential keys with equal intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the advantages of range-based partitioning?

A
  • Appropriate for attributes with a natural range of values
  • Partitions are a meaningful division of the records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are the disadvantages of range-based partitioning?

A
  • Imbalanced partitions if the values are unevenly distributed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How do cloud solutions use horizontal partitioning?

A
  • Amazon Dynamo and Aurora spread partitions across cluster nodes
  • Azure Cosmos DB spreads partitions across different geographic regions
48
Q

What are partitioning strategies available for non-database use?

A
  • Directory-based partitioning
  • Geospatial partitioning
49
Q

What is directory-based partitioning?

A
  • Divides data into folders in a file hierarchy
  • Division is based on attributes, with data of data creation also considered
50
Q

What is geospatial partitioning?

A
  • Considers geographic locations
  • Used when low latency is needed or data needs to be processed close to the origin
  • Used to comply with data protection regulations or data sovreignity
51
Q

What are examples of non-database cloud partitioning solutions?

A
  • Azure Blob Storage & Azure Data Lake Storage - partition using containers and folders which may be located in different regions
  • Amazon S3 storage uses buckets and prefixes for partitions
52
Q

What components make up distributed data processing?

A
  • Data partitioning
  • Data shuffling
  • Task scheduling
  • Data-based code execution
  • Data Storage
  • Fault tolerance
  • Performance optimisation
53
Q

How is data shuffling used in distributed data processing?

A

Reviews data distributions to manage load balancing and ensure efficient processing

54
Q

How is task shuffling used in distributed data processing?

A
  • Assigns tasks to nodes
  • Supervises execution
  • Handles execution failures
55
Q

How is data -based code execution used in distributed data processing?

A
  • Brings the code to the data
  • Constasts the tradition approach of loading data into a centralised processing environment
56
Q

How is data storage used in distributed data processing?

A

Ensures fast access to data for processing, regardless of whether data is stored on disk or in memory

57
Q

How is fault tolerance used in distributed data processing?

A
  • Detects and manages failures
  • Ensures continuity of processing is a node fails
58
Q

How is performance optimisation used in distributed data processing?

A

Focuses on reducing data movements, intermediate steps and overhead communication

59
Q

What is Hadoop Distributed File System?

A
  • Distributed data storage
  • Data storage foundation for the Hadoop Ecosystem
  • Implements redundant copies of data to provide high availability and fault tolerance
  • Unified view of data at the logical level (files/folders)
  • Resources are split internally and stored on different nodes
60
Q

What is the storage capacity of HDFS?

A

Petabytes of data

61
Q

How can HDFS be scaled?

A
  • Vertically: increasing node capacity
  • Horizontally: adding nodes to the cluster
62
Q

What is the configuration of HDFS?

A

Master-Slave configuration

63
Q

What are the components of HDFS?

A
  • NameNodes
  • DataNodes
64
Q

What does a HDFS NameNode do?

A
  • Maintains access to resources
  • Keeps system metadata
  • Maintains a table to map data blocks to DataNodes
65
Q

What does a HDFS DataNode do?

A
  • Responsible for data storage
  • Made up of commodity hardware with several disks for a large storage capacity
66
Q

How is HDFS fault tolerant?

A
  • High tolerance for failures
  • High availability
  • Addresses failover management at the application level
67
Q

Besides fault tolerance, what is another advantage of HDFS?

A

Parallelisable: able to process tasks simultaneously on several machines

68
Q

What is the typical memory side for a HDFS data block?

A

128mb

69
Q

How many nodes is data replicated to in HDFS?

A

At least 3 nodes

70
Q

What happens if a DataNode fails?

A

The request is redirected to another DataNode

71
Q

What DataNode state is monitored by a NameNode?

A

Failure state so that rew replication can be scheduled if the NameNode detects a failure

72
Q

Who originally created the Hadoop Ecosystem?

A

Google

73
Q

What is the aim of MapReduce?

A

To provide large scale distributed computing

74
Q

What are the two main functions of MapReduce?

A
  • Map
  • Reduce
75
Q

How does the map function work in MapReduce?

A
  • Takes an input value
  • Carries out stateless computation
  • Outputs a key-value pair
76
Q

Why is MapReduce code copied to machines holding data?

A

To keep data transfers to a minimum

77
Q

What is the shuffle phase of MapReduce?

A
  • An intermediate step between Map and Reduce
  • Sorts all of the keys
78
Q

What is the reduce function of MapReduce?

A

The values of the sorted keys are aggregated by key

79
Q

What are typical applications for MapReduce?

A
  • Rule based filtering
  • Aggregation
  • Counting
  • Descriptive statistics
80
Q

Which algorithm originally used MapReduce?

A

Google’s PageRank algorithm

81
Q

In what way was MapReduce built to be fault tolerant?

A

Expects frequent node failures and failovers

82
Q

What are the disadvantages of MapReduce?

A
  • Frequent read and writes to disk due to repeatable checkpoints to provide fault tolerance
  • Programming complex computations is an intricate and difficult task
83
Q

How does Apache Pig work?

A
  • Has a high-level abstraction of the data processing
  • Develops programs with a simplified scripting language which uses MapReduce under the hood
84
Q

How does Apache Hive work?

A
  • Adds higher level abstraction
  • Provides a SQL-like interface for grouping, querying and joining data
85
Q

What advantage does Spark have over MapReduce?

A
  • Provides a high level abstraction of data storage and processing
  • Faster
  • Can utilise a number of libraries
  • Groups intermediate steps and keeps them in memory as a Directed Acyclic Graph
86
Q

How can Spark be run?

A
  • On a single machine
  • Within a cluster of nodes
87
Q

What can Spark be used for?

A
  • Batch processing
  • Stream processing
88
Q

What programming languages does Spark provide APIs for?

A
  • PYthon
  • R
  • Scala
  • Java
89
Q

What is Apache Spark Core?

A
  • The foundation of Spark
  • Uses in-memory computation based on Resilient Distributed Datasets
90
Q

What are Resilient Distributed Datasets?

A

A immutable collection of objects distributed across multiple cluster nodes

91
Q

How do Resilient Distributed Datasets work?

A
  • Each object contains data objects
  • Resilient refers to data replication and being able to avoid data loss by recovering from node failures
92
Q

What data sources can Spark be used with?

A
  • HDFS
  • S3
  • Relational Databases
  • NoSQL databases
93
Q

What are some of the different libraries available for Spark?

A
  • Spark SQL
  • Spark Streaming
  • Spark MLlib
  • Spark Graph X
  • PySpark
94
Q

What is Spark SQL?

A
  • Framework for processing structured data
  • SQL and DataFrames can be used to query/work with various data sources
95
Q

What is Spark Streaming?

A
  • Used for batch and stream processing
  • Provides scalable, high throughtput and fault tolerant stream processing from different streaming services such as Kafka
  • Uses mini-batches for stream processing
96
Q

What is the interval size for a mini-batch used in Spark Streaming?

A

Batch interval limited to seconds or less

97
Q

What is Spark MLlib?

A
  • Machine Learning library
  • Implements a range of algorithms
  • Used in large scale machine learning
98
Q

What is Spark Graph X?

A

Used for distributed graph processing

99
Q

What is PySpark?

A
  • Python library
  • Used for writing parallelisable code for data processing and machine learning
  • Provdes data processing across multiple cluster nodes
100
Q

What data processing can PySpark carry out?

A
  • Mapping
  • Filtering
  • Joining
  • Group-by-key operations
101
Q

What is Apache Storm?

A

A framework for processing distributed data streams

102
Q

How does Apache Storm work?

A

Uses Directed Acyclic Graphs made up of small, discrete operationd that make up the data transformation process

103
Q

How does Apache Storm process data?

A
  • Streaming data are transported between nodes, travelling along edges
  • A particular data transformation takes place at each node
104
Q

What is Apache Samza?

A

A framework for processing distributed data streams

105
Q

How does Apache Samza work?

A

Allows for stafeul applications processing data from streaming resources such as Kafka

106
Q

What are the components of Apache Samza?

A
  • Producers: write to a topic
  • Consumers: read data from a topic
  • Brokers: handle the data processing in a cluster of machines
107
Q

How does Apache Samza process data?

A

Incoming data are sorted into topics

108
Q

What is Apache Flink?

A
  • A framework for processing distributed data streams
  • High performance solution
109
Q

How does Apache Flink work?

A
  • Implements batch and stream processing
  • Executes stateful processing such as windowingprocessing
110
Q

What are the typical processing steps in Apache Flink?

A
  • Input data is read from streams and partitioned for parallel processing
  • Data transformed using map, filter or reduce operations
  • Data is aggregated, has windowingprocessing applied and then output to a data sink
111
Q

How is fault tolerance implemented in Apache Flink?

A

Uses checkpoints of the data stream state

112
Q

Howq does AWS Elastic Map Reduce’s data processing framework work?

A

Processes partitioned S3 data

113
Q

How does HDInsights’ data processing framework work?

A
  • Managed cloud service for processing big data in a scalable and fault tolerant way
  • Uses Hadoop, Spark or Hive
114
Q

What is Databricks’ data processing framework?

A
  • Managed platform for big data analysis and machine learning in the cloud
  • Deployed on spark
115
Q

What utility does Databricks’ processing framework provide?

A
  • Data ingestion solutions for big data analysis and machine learning in the cloud
  • User management and utility tools
  • Highly scalable and can process large amounts of data