Chapter 5: Data-Intensive Applications Flashcards

1
Q

What are the challenges of large-scale data processing?

A

Writing efficient parallel and distributed processing jobs is hard
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three Requirements for Data- Intensive Applications as an abstraction layer?

A
  1. Developers don‘t have to think about parallelization
    ♦ Can continue to write sequential code
    ♦ Code is independent of degree of parallelism at runtime
  2. Developers don‘t have to think about fault tolerance
    ♦ Abstraction layer takes care of failed nodes
    ♦ Re-executes lost parts of computations if necessary
  3. Developers don‘t have to think about load balancing
    ♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Map and Reduce?

A

MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the Signature and Guarantees of the Map Function?

A

Signature: (k1, v1) → list(k2, v2)
● Guarantees to the first-order function
■ First-order function is invoked once for each KV pair
■ Can produce a [0,*] KV pairs as output
● Useful for projections, selection, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the Signature and Guarantees of the Reduce Function?

A

Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name the five steps of Map and Reduce.

A
  • Input Data
  • Map Phase
  • Shuffle Phase - Group intermediate results of map phase by key
  • Reduce phase
  • Output Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain how MapReduce works. (Example Word Count)

A
  1. Input:KV pairs as are transferd as an Input for the Map function
  2. Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
  3. Schuffle phase groups the results from the map phase by key
  4. Reduce Phase: execute function (count) on the grouped KV pairs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the Map function?

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s map phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the Mapper?

A

■ A process running on a worker node

■ Invokes map function for each KV pair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the Reduce function?

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s reduce phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the Reducer?

A

■ Process invoking reduce function on grouped data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does the Distributed Execution of MapReduce work?

A
  1. Client partitions input file into input splits
  2. Client submits job to master
  3. Mapper started for each input splits
  4. Reducers pull data from mappers over network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the Limitation of MapReduce?

A
  1. Assumes finite input (files only)
    Limitation of finite input prevents streaming processing
  2. Data between MR jobs must go to Google File System
    Constraint to write to GFS especially detrimental for iterative algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are three scenarios of possbile MapReduce fault?

A
  1. Mapper fails
  2. Reducer fails
  3. Entire worker node fails
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can a Mapper fault be corrected?

A

■ Master detects failure through missing status report

■ Mapper is restarted on diff. node, re-reads data from GFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can a Reducer fault be corrected?

A

■ Again, detected through missing status report

■ Reducer is restarted on different node, pulls intermediate results for its partition from mappers again

17
Q

How can an entire worker node fault be corrected?

A

■ Master re-schedules lost mappers and reducers

■ Finished mappers may be restarted to recompute lost intermediate results

18
Q

Explain master-worker pattern an MapReduce.

A

Master
■ Responsible for job scheduling
■ Monitoring worker nodes, detecting dead nodes
■ Load balancing
Workers
■ Executing map and reduce functions
■ Storing input/output data (in traditional setup)
■ Periodically report availability to master node

19
Q

Analytics Cluster Setup.

A
  • Applications
  • Processing Frameworks
  • Resource Management System
  • Distributed File System
20
Q

Comparison to HPC.

A
HPC:
- Flexible and fast low-level code
- Architecture-specific implementations
- High-performance hardware
Data-intensive Apps/ Distr. Dataflows:
- High-level dataflows
- Scalable fault-tolerant distr. engines
- Commodity Clusters
21
Q

GFS (Goolge File System).

A

Criteria:

  • Scalability
  • High performance
  • Support for commodity clusters
  • Fault-tolerance
22
Q

Distributed Storage.

A

Distributed file system and system on top, running on commodity servers. Fault tolerance and parallel access through replication.

23
Q

HDFS (Hadoop Distributed File System).

A

All data is stored in blocks, replicated on multiple data nodes. For scalability: Write-once-read-many filesystem.

24
Q

Google Bigtable.

A

Efficiently retrieve structured data. High-throughput NoSQL store through multi-dimensional sorted map on top of GFS.

25
Q

Spark.

A

At core are parallel transformations of Resilient Distributed Datasets (RDDs). For streaming Spark discretizes stream to Microbatches.

26
Q

Flink.

A

At core is a streaming dataflow engine that supports both batch and stream processing. Directed acyclic graph (DAG), master-slave paradigm.

27
Q

Spark vs. Flink.

A
  • Flink has lower latency (‘Real’ Streaming: tuple-wise processing)
  • Spark has higher throughput (Microbatches: processing of small batches of tuples)