MapReduce Flashcards

1
Q

MapReduce
A MapReduce __ is a unit of work that the client wants to be performed
Input data and the MapReduce program are deployed to ____ for execution
The job is broken down into ___
____ is a process of identifying _____
____ is a process of iterative ___ to identify desired values

A

A MapReduce job is a unit of work that the client wants to
be performed
 Input data and the MapReduce program are deployed to
server nodes for execution
 The job is broken down into tasks (map and reduce)
 Mapping is a process of identifying key/value pairs
 Reducing is a process of iterative sorting to identify
desired values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Key Value Mapping
___ function extracts the year and temp (celcius)
Year and temp a are the ___

A

Mapper function extracts the year and temperature (celsius)
 Year and temperature values are the output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Output
__ are output and sorted

A

Values are output and sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reducing
___ iterates through the list and selcts the maximum temp values

A

Reducer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MapReduce Process Flow
Map extracts ____
Shuffle collects ___
Reduces located ____
Unix command equivalent function is ____

A

Map extracts values
 Shuffle collects values
 Reduces locates specific output
 Unix command equivalent function is under the flow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Map Tasks
Input __
MapReduce ____
Configuration ____
Map ___
One map task is assgined to each ___

A

 Input data
 MapReduce program
 Configuration information
 Map tasks
 One map task is assigned to each split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Map Selection of Nodes
Tries to run the map tasks on a node ____ where the data resides
If full looks for a ______ on the same server rack
If busy, an ___ off-rack node is supplied

A

Tries to run the map tasks on a node where the data resides
 If full, looks for a free map node on the same server rack
 If busy, an off-rack node is supplied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Reduce Tasks
Input is the ____ from all mappers
Output is ______ on the node where reudce is running

A

 Input is the output from all mappers
 Output is transferred and merged on the node where reduce is
running

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The Shuffle
You can choose the ____ for a job
When multiple, map tasks ____, allocating one partition for each ____

A

You can choose the number of reduce tasks for a job
 When multiple, map tasks partition output, allocating one
partition for each reduce job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Combiner Function
_____ is used when there are multiple spill files from mappers

A

combiner is used when there are multiple spill files from mappers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

YARN
-Yet Another Resource Negotiatior
-Sometimes called the operating system of a ____
-With so many application running, there was a need for something to access to the _____
-With YARN, ___ is not limited to MapReduce accessing data

A

 Yet Another Resource Negotiator
 Sometimes called the operating system of a cluster
 With so many applications running, there was a need for
something to coordinate access to the system resources
 With YARN, Hadoop is not limited to MapReduce accessing
data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Multiple Application Engines
Batch programs (____,___)
Interactive SQL (___,___)
Advanced Analytics (___)
Streaming(_____)

A

Batch programs (MapReduce, Spark)
 Interactive SQL (Hive, Impala)
 Advanced Analytics (Spark)
 Streaming (Spark Streaming)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mapreduce Jobtracker performs job ____, ____ and ____
YARN uses ____ resources for these fucntions

A

MapReduce Jobtracker performs job scheduling, progress
monitoring, and job history
 YARN uses three separate resources for these functions

CHART
Map Reduce 1 (YARN)
-Jobtracker(resource manager, appliction master, timeline server)
-Tasktracker(Node manger)
-Slot(container)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scalability Improvment
MapReduce can scale to _______
YARN can scale to ______
YARN provides ____
YARN manges a pool of ______

A

MapReduce can scale to 4K nodes and 40,000 tasks
 YARN can scale to 10K nodes and 100,000 tasks
 YARN provides high availability
 YARN manages a pool of resources versus fixed slots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

YARN Daemons
-______(RM)
Runs on ___
Global resource ____
Arbitates sytem resource between ______
Pluggable scheduler to support _____

-_____(NM)
Runs on ____
COmmunicates with ___

A

Resource Manager (RM)
 Runs on master node
 Global resource scheduler
 Arbitrates system resources between competing applications
 Pluggable scheduler to support different algorithms
 Node Manager (NM)
 Runs on worker nodes
 Communications with RM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

YARN Daemon Model

A

look at graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Applications on YARN
-_____
-Created by the RM ___
-Allocate a certain amount of resrouces (____,___) on a worker node
-Application run in ______
-_________
-One per ___
- ___/___ specifc
-runs in a ____
-Requests more containers to ____

A

Containers
 Created by the RM upon request
 Allocate a certain amount of resources (memory, CPU) on a
worker node
 Applications run in one or more containers
 Application Master (AM)
 One per application
 Framework/application specific
 Runs in a container
 Requests more containers to run application tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

FiFO Scheudling
-Simple
-Not suitable for ______
-Large application will ____

A

Simple
 Not suitable for shared clusters
 Large application will backlog others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Capacity Scheduling
-provides ____ by queue
-____ can be aligned with oranization function
-______ allows idle resources to be shared

A

Provides multiple parallel jobs by queue
 Queues can be aligned with organization functions
 Queue elasticity allows idle resources to be shared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fair scheduler
-___ are dynamically balanced for resources
-As succesive jobs are schduled within a queue, that queue shares the _____ of its resource

A

Queues are dynamically balanced for resources
 As successive jobs are scheduled within a queue, that queue
shares the equal allocation of its resource

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Working with YARN
Hadoop includes ____ major YARN tools for developers
-
-
-
-____ need to be able to
-_____ to run on YARN cluster
-__ and ___ jobs

A

 Hadoop includes three major YARN tools for developers
 HUE job browser
 YARN Web UI
 YARN command line
 Developers need to be able to
 Submit jobs (applications) to run on YARN cluster
 Monitor and manage jobs

20
Q

Working with YARN
HUE
-___ status of a job
-___ logs
-kill a ____

YARN Web UI
- ____ is the main entry point
-More detailed view than ___
-No __ or configuration

A

HUE
 Monitor status of job
 View logs
 Kill a running job
YARN Web UI
 RM UI is the main entry point
 More detailed view than HUE
 No controls or configurations

21
Q

Working with the YARN Application
YARN comand line
Most tools are for ___ versus developers
yarn <command></command>
yarn application
- use ____ to see running applications
- use ____ to kill a running application
-use _____ to view yarn log with

A

YARN command line
 Most tools are for administrators versus developers
 yarn <command></command>
 yarn application
 Use -list to see running applications
 Use -kill to kill a running application
 Use -applicationID <app-id> to view yarn logs with</app-id>

22
Q

YARN MapReduce

A

See chart

23
Q

MapReduce Job Submission
____ is submitted from a client
receives ____
Ensurues an output ______
computes input splits ____
Replicates job ____ file with the ___ across the cluster (10x)

A

MapReduce job is submitted from a client
 Receives application ID
 Ensures an output specification/directory
 Completes input splits (parallel processing)
 Replicates job JAR file with the ID across the cluster (10X)

24
Q

YARN Resource Manager
-Coordinates allocation of ____ for the cluster
-Scheulues ___ for MapReduce tasks
-Engages the application _____
-__ the ____ every second for status

A

Coordinates allocation of computing resources for the cluster
 Schedules containers for MapReduce tasks
 Engages the application master process in a node
 Polls the Application Master every second for status

25
Q

Resource Manger Failure
SIngle point of ___
____ is achieved with a standby ____
Status is captrued in ______ or ____
Node manager is not stored in this _____

A

Single point of failure
 High availability is achieved with a standby Resource Manager
 Status is captured in HDFS or Zookeeper
 Node manager is not stored in this recovery file

26
Q

YARN Application Master
-___ the taks running the ___
-Runs in a ___ with a ____
-keeps track of each ____

A

Coordinates the tasks running the MapReduce job
 Runs in a container with a MapReduce Task
 Keeps track of each nodes progress

27
Q

Application Master Writing Splits
____ creates a map task for each input split
creates a number of ____ tasks baded on setting
____ tasks to
-under 10 mappers/1 reducer (1 block runs in self)
-Large jobs assign more tasks in ___
Sets up ___

A

 Application Master creates a map task for each input split
 Creates a number of reduce tasks based on setting
 Evaluates/assigns tasks to
– Under 10 mappers/1 reducer (1 block runs in self)
 —Large jobs assign more tasks in parallel
 Sets up output job

28
Q

Application Master Failure
___ heart beats are sent to the ___
if an update is not recieved _____ creates a new _____
AM fail __ times (default) before the job is failed

A

Periodic heart beats are sent to the resource manager
 If an update is not received, Resource Manager creates a new
Application Master
 AM may fail 2 times (default) before the job is failed

29
Q

YARN Node Managers
-___ and ___ computer containers on machine in the cluster

A

Launch and monitor compute containers on machines in the
cluster

30
Q

Node Task Assignment
___ requests containers from the ____
Each task is provided ___ of memory and __ processing core
Request for ___ are fulfilled first with ___
Request for Reduce taks are made after ___
Honors running reduce taks on the ____
If the tasks are not on the ___ the execution is honored

A

 Application Master requests containers from the Resource
Manager
 Each task is provided 1024 MB of memory and one processing
core
 Requests for Map tasks are fulfilled first with data locality
 Requests for Reduce tasks are made after 5% map progress
 Honors running reduce tasks on the same node as the map
 If the tasks are not on the same rack the execution is honored

31
Q

Node Task Execution
Application Master contacts the ____
Executes the task with ____
servers nodes ____
output committers ____

A

 Application Master contacts the Node Manager
 Executes the task with local node resources
 Server nodes run independently
 Output committers write output files

32
Q

Node Manager/Task failure
_____ fails, node manager updates Application Master
Appliction Master kills taks with n______
AM reschules failed tasks on different ___
Taks can only fail __ configurable then the whole job is failed
___ (not used) tasks do not count
___ percentage of failures can also be configured

A

 Java machine fails, node manager updates Application Master
 Application Master kills tasks with no update over 10 min
 AM reschedules failed tasks on different node
 Tasks can only fail 4 times (configurable) then the whole job is
failed
 AM killed (not used) tasks do not count
 Percentage of failures can also be configured

33
Q

Application master updates
overall job status _____
status on each task in the job (same….)
Reading an input ___
Map progress _____
Reduce prgoresss ____
writing an ____

A

 Overall job status (running, successful, completed, failed)
 Status on each task in the job (same ….)
 Reading an input record
 Map progress (% of task completed)
 Reduce progress (estimate % of input processed)
 Writing an output record

34
Q

Job Status updates
-Tasks report back to the ___
-Application Mster reports ____
-On compeltion ststus Application Master and ___ clean up their working state
-history of the ____ is ___

A

 Tasks report back to the Application Master
 Application Master reports status to the client
 On completion status Application Master and task containers
clean up their working state
 History of the task completion is archived

35
Q

Shuffle
-___ to every reducer has a __
-Shuffle is the process used to __ and __ map out to reducers

A

 Input (map) to every reducer has a key
 Shuffle is the process used to sort and transfer map output to
reducers

36
Q

Mapping Output
-100 MB ___ where output is written
-At ___ (80MB) the buffer begins the ___
-Continue in parallel, if buffer fills, map is blocked until spll is ____
-____ are crated based on the __ reducer
-In memroy sort by __ is peformed within the partition
-____ are merged into a ___ and sorted output file

A

 100 MB memory buffer where output is written
 At 80% (80 MB) the buffer begins the spill to disk
 Continue in parallel, if buffer fills, map is blocked until spill is
completed
 Partitions are created based on the “to” reducer
 In memory sort by key is performed within the partition
 Spill files are merged into a single partitioned and sorted
output file

37
Q

Combiners
Combiner uses the output of each ____
If ther are ___ or more spill files a combiner is run

A

Combiner uses the output of each partition sort
 If there are 3 or more spill files a combiner is run

38
Q

Copy Phase
Reudcer asks Application Master for ___
Reduce tasks start ocpying their map outputs as soon as they ______
Several ___ can be copying in ___

A

 Reducer asks Application Master for map output
 Reduce tasks start copying their map outputs as soon as they
are ready as the “copy phase”
 Several threads can be copying in parallel

39
Q

Merge Phase (Sort)
-After all map outputs are copied ___
-Aligned by the ___

A

 After all map outputs are copied, sorting begins
 Aligned by the “merge factor”

look at chart

40
Q

MapReduce Process

A

map to shuffle/sort to reduce

look at chart

41
Q

Input Text Data File

A

Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner.
A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically, both the
input and the output of the job are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically, the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture
Guide) are running on the same set of nodes. This configuration allows the framework
to effectively schedule tasks on the nodes where data is already present, resulting
in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture
Guide).
Minimally, applications specify the input/output locations and supply map and reduce
functions via implementations of appropriate interfaces and/or abstract-classes.
These, and other job parameters, comprise the job configuration.
The Hadoop job client then submits the job (jar/executable etc.) and configuration
to the ResourceManager which then assumes the responsibility of distributing the
software/configuration to the workers, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in Java, MapReduce applications need
not be written in Java.

LOOK AT PIC

42
Q

Mapper Takes Key/Value Pairs

A

(0,Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multiterabyte data-sets) in-parallel on large clusters (thousands
of nodes) of commodity hardware in a reliable, fault-tolerant
manner.)
(246,A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map tasks in a
completely parallel manner. The framework sorts the outputs of
the maps, which are then input to the reduce tasks. Typically,
both the input and the output of the job are stored in a filesystem. The framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.)
(653, Typically, the compute nodes and the storage nodes are
the same, that is, the MapReduce framework and the Hadoop
Distributed File System (see HDFS Architecture Guide) are
running on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth
across the cluster.)

LOOK AT PIC

43
Q

Mappers
-number of mappers determinend by ___
Typically we can use __ for a small text file
-try to split at the ___ staying on one node
-____ will bog down a job with overhead
-Map spills are written to __________
-Map output is not save, only if _____

A

 Number of mappers determined by input splits
 Typically, we use one mapper for a small text file
 Try to split at the block size staying on one node
 Small splits will bog down a job with overhead
 Map spills are written to local disk, not HDFS
 Map output is not saved, only re-run if necessary

44
Q

Mappers Provide Key Value Pairs

A

(Hadoop, 1)
(MapReduce, 1)
(is, 1)
(a, 1)
(software, 1)
(framework, 1)
(for, 1)
(easily, 1)
(writing, 1)
(applications, 1)
(which, 1)
(process, 1)
(vast, 1)
(amounts, 1)
(of, 1)
(data, 1)
(multi-terabyte, 1)
(data-sets, 1)
(in-parallel, 1)
(on, 1)
(large, 1)
(clusters, 1)
(thousands, 1)
(of, 1)
(nodes, 1)
(of, 1)
(commodity, 1)
(hardware, 1)
(in, 1)
(a, 1)
(reliable, 1)
(fault-tolerant, 1)
(manner, 1)

LOOK AT PIC

45
Q

Key Value Pairs Input to Reducers

A

(key, value)
(of,[1,1,1,1])
(a,[1,1,1,1])
(writing,1)
(which,1)
(vast,1)
(thousands,1)
(software,1)
(reliable,,1)
(process,1)
(on,1)
(nodes,1)
(multi-terabyte,1)
(manner,1)
(large,1)
(is,1)
(key, value)
(in-parallel,1)
(in,1)
(hardware,1)
(framework,1)
(for,1)
(fault-tolerant,1)
(easily,1)
(data-sets,1)
(data,1)
(commodity,1)
(clusters,1)
(applications,1)
(amounts,1)
(MapReduce,1)
(Hadoop,1)

LOOK AT PIC

46
Q

Reducers
-Number of reduercers is determined by the ___
-There is a file written for each _____
-Output of reducers is written to HDFS with ___
-If there are ____, how many files are saved
-FIles are usally ___ across the network
-Files will have more than ____, all files will have ____

A

 Number of reducers is determined by the user
 There is a file written for each reducer/partition
 Output of reducers is written to HDFS with 3 replications
 If there are two partitions, how many files are saved?
 Files are usually transferred across the network
 Files will have more than one key, all files will have distinct
keys

47
Q

Key value pairs after reduce

A

(of,4)
(a,4)
(writing,1)
(which,1)
(vast,1)
(thousands,1)
(software,1)
(reliable,1)
(process,1)
(on,1)
(nodes,1)
(multiterabyte,1)
(manner,1)
(large,1)
(is,1)
50
(in-parallel,1)
(in,1)
(hardware,1)
(framework,1)
(for,1)
(faulttolerant,1)
(easily,1)
(data-sets,1)
(data,1)
(commodity,1)
(clusters,1)
(applications,1)
(amounts,1)
(MapReduce,1)
(Hadoop,1)

LOOK AT PIC

48
Q

MapReduce Fow

A

map–map–map-map
|
shuffle/sort
|
reduce–reduce–reduce -reduce
LOOK AT PIC

49
Q

Summary

A

MapReduce provides an efficient way to process large data sets
in parallel
 MapReduce is the source code for all big data applications
 Mapping identifies Key Values
 Reducing identifies specific output for a solution
 Mapping writes to multiple nodes as close as possible
 Mappers may provide duplicate files to each Reducer
YARN provides a more scalable efficient approach to
MapReduce
 Resource Manager coordinates allocation of computing
resources for the cluster
 Application Master coordinates the tasks running the
MapReduce job
 Node Managers launch and monitor compute containers on
machines in the cluster
 Shuffle is the process used to sort and transfer map output to
reducer