Chapter 4: Flashcards
What are the complexities of Data and Analytics (4V Data)
- volume (data size) ← Big Data
- velocity (freshness, data rate, streams) ←
- variability (format/media type)
- veracity (uncertainty/quality)
What are the complexities of Data and Analytics (4I Analysis)
- interactive (visual analytics, ad-hoc)
- integrative (extraction, fusion)
- iterative (learning, models)
- incremental (mutable state, windows)
What are the challenges of large-scale data processing
● Large clusters/clouds have 100s/1000s of servers
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers
What are the three Requirements for Data- Intensive Applications as an abstraction layer
- Developers don‘t have to think about parallelization
♦ Can continue to write sequential code
♦ Code is independent of degree of parallelism at runtime - Developers don‘t have to think about fault tolerance
♦ Abstraction layer takes care of failed nodes
♦ Re-executes lost parts of computations if necessary - Developers don‘t have to think about load balancing
♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes
What is Map and Reduce?
MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.
What are the Signature and Guarantees of the Map Function?
Signature: (k1, v1) → list(k2, v2) ● Guarantees to the first-order function ■ First-order function is invoked once for each KV pair ■ Can produce a [0,*] KV pairs as output ● Useful for projections, selection, ...
What are the Signature and Guarantees of the Reduce Function?
Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …
Name the five steps of Map and Reduce.
- Input Data
- Map Phase
- Shuffle Phase - Group intermediate results of map phase by key
- Reduce phase
- Output Data
Explain how MapReduce works. (Example Word Count)
- Input:KV pairs as are transferd as an Input for the Map function
- Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
- Schuffle phase groups the results from the map phase by key
- Reduce Phase: execute function (count) on the grouped KV pairs.
Explain the Map function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s map phase
Explain the Mapper?
■ A process running on a worker node
■ Invokes map function for each KV pair
Explain the Reduce function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s reduce phase
Explain the Reducer?
■ Process invoking reduce function on grouped data
How does the Distributed Execution of MapReduce work?
- Client partitions input file into input splits
- Client submits job to master
- Mapper started for each input splits
- Reducers pull data from mappers over network
What are the Limitation of MapReduce?
- Assumes finite input (files only)
Limitation of finite input prevents streaming processing - Data between MR jobs must go to Google File System
Constraint to write to GFS especially detrimental for iterative algorithms