Chapter 5: Data-Intensive Applications Flashcards by Linus Jansen

What are the challenges of large-scale data processing?

Writing efficient parallel and distributed processing jobs is hard
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers

How well did you know this?

Not at all

Perfectly

What are the three Requirements for Data- Intensive Applications as an abstraction layer?

Developers don‘t have to think about parallelization
♦ Can continue to write sequential code
♦ Code is independent of degree of parallelism at runtime
Developers don‘t have to think about fault tolerance
♦ Abstraction layer takes care of failed nodes
♦ Re-executes lost parts of computations if necessary
Developers don‘t have to think about load balancing
♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes

How well did you know this?

Not at all

Perfectly

What is Map and Reduce?

MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.

How well did you know this?

Not at all

Perfectly

What are the Signature and Guarantees of the Map Function?

Signature: (k1, v1) → list(k2, v2)
● Guarantees to the first-order function
■ First-order function is invoked once for each KV pair
■ Can produce a [0,*] KV pairs as output
● Useful for projections, selection, …

How well did you know this?

Not at all

Perfectly

What are the Signature and Guarantees of the Reduce Function?

Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …

How well did you know this?

Not at all

Perfectly

Name the five steps of Map and Reduce.

Input Data
Map Phase
Shuffle Phase - Group intermediate results of map phase by key
Reduce phase
Output Data

How well did you know this?

Not at all

Perfectly

Explain how MapReduce works. (Example Word Count)

Input:KV pairs as are transferd as an Input for the Map function
Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
Schuffle phase groups the results from the map phase by key
Reduce Phase: execute function (count) on the grouped KV pairs.

How well did you know this?

Not at all

Perfectly

Explain the Map function?

■ First-order function provided by user

■ Specifies what happens to the data in job‘s map phase

How well did you know this?

Not at all

Perfectly

Explain the Mapper?

■ A process running on a worker node

■ Invokes map function for each KV pair

How well did you know this?

Not at all

Perfectly

Explain the Reduce function?

■ First-order function provided by user

■ Specifies what happens to the data in job‘s reduce phase

How well did you know this?

Not at all

Perfectly

Explain the Reducer?

■ Process invoking reduce function on grouped data

How well did you know this?

Not at all

Perfectly

How does the Distributed Execution of MapReduce work?

Client partitions input file into input splits
Client submits job to master
Mapper started for each input splits
Reducers pull data from mappers over network

How well did you know this?

Not at all

Perfectly

What are the Limitation of MapReduce?

Assumes finite input (files only)
Limitation of finite input prevents streaming processing
Data between MR jobs must go to Google File System
Constraint to write to GFS especially detrimental for iterative algorithms

How well did you know this?

Not at all

Perfectly

What are three scenarios of possbile MapReduce fault?

Mapper fails
Reducer fails
Entire worker node fails

How well did you know this?

Not at all

Perfectly

How can a Mapper fault be corrected?

■ Master detects failure through missing status report

■ Mapper is restarted on diff. node, re-reads data from GFS

How well did you know this?

Not at all

Perfectly

How can a Reducer fault be corrected?

Study These Flashcards

■ Again, detected through missing status report

■ Reducer is restarted on different node, pulls intermediate results for its partition from mappers again

How can an entire worker node fault be corrected?

Study These Flashcards

■ Master re-schedules lost mappers and reducers

■ Finished mappers may be restarted to recompute lost intermediate results

Explain master-worker pattern an MapReduce.

Study These Flashcards

Master
■ Responsible for job scheduling
■ Monitoring worker nodes, detecting dead nodes
■ Load balancing
Workers
■ Executing map and reduce functions
■ Storing input/output data (in traditional setup)
■ Periodically report availability to master node

Analytics Cluster Setup.

Study These Flashcards

Applications
Processing Frameworks
Resource Management System
Distributed File System

Comparison to HPC.

Study These Flashcards

HPC:
- Flexible and fast low-level code
- Architecture-specific implementations
- High-performance hardware
Data-intensive Apps/ Distr. Dataflows:
- High-level dataflows
- Scalable fault-tolerant distr. engines
- Commodity Clusters

GFS (Goolge File System).

Study These Flashcards

Criteria:

Scalability
High performance
Support for commodity clusters
Fault-tolerance

Distributed Storage.

Study These Flashcards

Distributed file system and system on top, running on commodity servers. Fault tolerance and parallel access through replication.

HDFS (Hadoop Distributed File System).

Study These Flashcards

All data is stored in blocks, replicated on multiple data nodes. For scalability: Write-once-read-many filesystem.

Google Bigtable.

Study These Flashcards

Efficiently retrieve structured data. High-throughput NoSQL store through multi-dimensional sorted map on top of GFS.

Spark.

At core are parallel transformations of Resilient Distributed Datasets (RDDs). For streaming Spark discretizes stream to Microbatches.

Flink.

At core is a streaming dataflow engine that supports both batch and stream processing. Directed acyclic graph (DAG), master-slave paradigm.

Spark vs. Flink.

- Flink has lower latency ('Real' Streaming: tuple-wise processing) - Spark has higher throughput (Microbatches: processing of small batches of tuples)

Chapter 5: Data-Intensive Applications Flashcards

(27 cards)