Chapter 4: Flashcards

1
Q

What are the complexities of Data and Analytics (4V Data)

A
  • volume (data size) ← Big Data
  • velocity (freshness, data rate, streams) ←
  • variability (format/media type)
  • veracity (uncertainty/quality)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the complexities of Data and Analytics (4I Analysis)

A
  • interactive (visual analytics, ad-hoc)
  • integrative (extraction, fusion)
  • iterative (learning, models)
  • incremental (mutable state, windows)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the challenges of large-scale data processing

A

● Large clusters/clouds have 100s/1000s of servers
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three Requirements for Data- Intensive Applications as an abstraction layer

A
  1. Developers don‘t have to think about parallelization
    ♦ Can continue to write sequential code
    ♦ Code is independent of degree of parallelism at runtime
  2. Developers don‘t have to think about fault tolerance
    ♦ Abstraction layer takes care of failed nodes
    ♦ Re-executes lost parts of computations if necessary
  3. Developers don‘t have to think about load balancing
    ♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Map and Reduce?

A

MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the Signature and Guarantees of the Map Function?

A
Signature: (k1, v1) → list(k2, v2)
● Guarantees to the first-order function
■ First-order function is invoked once for each KV pair 
■ Can produce a [0,*] KV pairs as output
● Useful for projections, selection, ...
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the Signature and Guarantees of the Reduce Function?

A

Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name the five steps of Map and Reduce.

A
  • Input Data
  • Map Phase
  • Shuffle Phase - Group intermediate results of map phase by key
  • Reduce phase
  • Output Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how MapReduce works. (Example Word Count)

A
  1. Input:KV pairs as are transferd as an Input for the Map function
  2. Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
  3. Schuffle phase groups the results from the map phase by key
  4. Reduce Phase: execute function (count) on the grouped KV pairs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the Map function?

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s map phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the Mapper?

A

■ A process running on a worker node

■ Invokes map function for each KV pair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the Reduce function?

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s reduce phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the Reducer?

A

■ Process invoking reduce function on grouped data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the Distributed Execution of MapReduce work?

A
  1. Client partitions input file into input splits
  2. Client submits job to master
  3. Mapper started for each input splits
  4. Reducers pull data from mappers over network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Limitation of MapReduce?

A
  1. Assumes finite input (files only)
    Limitation of finite input prevents streaming processing
  2. Data between MR jobs must go to Google File System
    Constraint to write to GFS especially detrimental for iterative algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are three scenarios of possbile MapReduce fault?

A
  1. Mapper fails
  2. Reducer fails
  3. Entire worker node fails
17
Q

How can a Mapper fault be corrected?

A

■ Master detects failure through missing status report

■ Mapper is restarted on diff. node, re-reads data from GFS

18
Q

How can a Reducer fault be corrected?

A

■ Again, detected through missing status report

■ Reducer is restarted on different node, pulls intermediate results for its partition from mappers again

19
Q

How can an entire worker node fault be corrected?

A

■ Master re-schedules lost mappers and reducers

■ Finished mappers may be restarted to recompute lost intermediate results

20
Q

Explain master-worker pattern an MapReduce

A

Master
■ Responsible for job scheduling
■ Monitoring worker nodes, detecting dead nodes
■ Load balancing
Workers
■ Executing map and reduce functions
■ Storing input/output data (in traditional setup)
■ Periodically report availability to master node