MapReduce Flashcards
MapReduce
A MapReduce __ is a unit of work that the client wants to be performed
Input data and the MapReduce program are deployed to ____ for execution
The job is broken down into ___
____ is a process of identifying _____
____ is a process of iterative ___ to identify desired values
A MapReduce job is a unit of work that the client wants to
be performed
Input data and the MapReduce program are deployed to
server nodes for execution
The job is broken down into tasks (map and reduce)
Mapping is a process of identifying key/value pairs
Reducing is a process of iterative sorting to identify
desired values
Key Value Mapping
___ function extracts the year and temp (celcius)
Year and temp a are the ___
Mapper function extracts the year and temperature (celsius)
Year and temperature values are the output
Output
__ are output and sorted
Values are output and sorted
Reducing
___ iterates through the list and selcts the maximum temp values
Reducer
MapReduce Process Flow
Map extracts ____
Shuffle collects ___
Reduces located ____
Unix command equivalent function is ____
Map extracts values
Shuffle collects values
Reduces locates specific output
Unix command equivalent function is under the flow
Map Tasks
Input __
MapReduce ____
Configuration ____
Map ___
One map task is assgined to each ___
Input data
MapReduce program
Configuration information
Map tasks
One map task is assigned to each split
Map Selection of Nodes
Tries to run the map tasks on a node ____ where the data resides
If full looks for a ______ on the same server rack
If busy, an ___ off-rack node is supplied
Tries to run the map tasks on a node where the data resides
If full, looks for a free map node on the same server rack
If busy, an off-rack node is supplied
Reduce Tasks
Input is the ____ from all mappers
Output is ______ on the node where reudce is running
Input is the output from all mappers
Output is transferred and merged on the node where reduce is
running
The Shuffle
You can choose the ____ for a job
When multiple, map tasks ____, allocating one partition for each ____
You can choose the number of reduce tasks for a job
When multiple, map tasks partition output, allocating one
partition for each reduce job
Combiner Function
_____ is used when there are multiple spill files from mappers
combiner is used when there are multiple spill files from mappers
YARN
-Yet Another Resource Negotiatior
-Sometimes called the operating system of a ____
-With so many application running, there was a need for something to access to the _____
-With YARN, ___ is not limited to MapReduce accessing data
Yet Another Resource Negotiator
Sometimes called the operating system of a cluster
With so many applications running, there was a need for
something to coordinate access to the system resources
With YARN, Hadoop is not limited to MapReduce accessing
data
Multiple Application Engines
Batch programs (____,___)
Interactive SQL (___,___)
Advanced Analytics (___)
Streaming(_____)
Batch programs (MapReduce, Spark)
Interactive SQL (Hive, Impala)
Advanced Analytics (Spark)
Streaming (Spark Streaming)
Mapreduce Jobtracker performs job ____, ____ and ____
YARN uses ____ resources for these fucntions
MapReduce Jobtracker performs job scheduling, progress
monitoring, and job history
YARN uses three separate resources for these functions
CHART
Map Reduce 1 (YARN)
-Jobtracker(resource manager, appliction master, timeline server)
-Tasktracker(Node manger)
-Slot(container)
Scalability Improvment
MapReduce can scale to _______
YARN can scale to ______
YARN provides ____
YARN manges a pool of ______
MapReduce can scale to 4K nodes and 40,000 tasks
YARN can scale to 10K nodes and 100,000 tasks
YARN provides high availability
YARN manages a pool of resources versus fixed slots
YARN Daemons
-______(RM)
Runs on ___
Global resource ____
Arbitates sytem resource between ______
Pluggable scheduler to support _____
-_____(NM)
Runs on ____
COmmunicates with ___
Resource Manager (RM)
Runs on master node
Global resource scheduler
Arbitrates system resources between competing applications
Pluggable scheduler to support different algorithms
Node Manager (NM)
Runs on worker nodes
Communications with RM
YARN Daemon Model
look at graph
Applications on YARN
-_____
-Created by the RM ___
-Allocate a certain amount of resrouces (____,___) on a worker node
-Application run in ______
-_________
-One per ___
- ___/___ specifc
-runs in a ____
-Requests more containers to ____
Containers
Created by the RM upon request
Allocate a certain amount of resources (memory, CPU) on a
worker node
Applications run in one or more containers
Application Master (AM)
One per application
Framework/application specific
Runs in a container
Requests more containers to run application tasks
FiFO Scheudling
-Simple
-Not suitable for ______
-Large application will ____
Simple
Not suitable for shared clusters
Large application will backlog others
Capacity Scheduling
-provides ____ by queue
-____ can be aligned with oranization function
-______ allows idle resources to be shared
Provides multiple parallel jobs by queue
Queues can be aligned with organization functions
Queue elasticity allows idle resources to be shared
Fair scheduler
-___ are dynamically balanced for resources
-As succesive jobs are schduled within a queue, that queue shares the _____ of its resource
Queues are dynamically balanced for resources
As successive jobs are scheduled within a queue, that queue
shares the equal allocation of its resource
Working with YARN
Hadoop includes ____ major YARN tools for developers
-
-
-
-____ need to be able to
-_____ to run on YARN cluster
-__ and ___ jobs
Hadoop includes three major YARN tools for developers
HUE job browser
YARN Web UI
YARN command line
Developers need to be able to
Submit jobs (applications) to run on YARN cluster
Monitor and manage jobs
Working with YARN
HUE
-___ status of a job
-___ logs
-kill a ____
YARN Web UI
- ____ is the main entry point
-More detailed view than ___
-No __ or configuration
HUE
Monitor status of job
View logs
Kill a running job
YARN Web UI
RM UI is the main entry point
More detailed view than HUE
No controls or configurations
Working with the YARN Application
YARN comand line
Most tools are for ___ versus developers
yarn <command></command>
yarn application
- use ____ to see running applications
- use ____ to kill a running application
-use _____ to view yarn log with
YARN command line
Most tools are for administrators versus developers
yarn <command></command>
yarn application
Use -list to see running applications
Use -kill to kill a running application
Use -applicationID <app-id> to view yarn logs with</app-id>
YARN MapReduce
See chart
MapReduce Job Submission
____ is submitted from a client
receives ____
Ensurues an output ______
computes input splits ____
Replicates job ____ file with the ___ across the cluster (10x)
MapReduce job is submitted from a client
Receives application ID
Ensures an output specification/directory
Completes input splits (parallel processing)
Replicates job JAR file with the ID across the cluster (10X)
YARN Resource Manager
-Coordinates allocation of ____ for the cluster
-Scheulues ___ for MapReduce tasks
-Engages the application _____
-__ the ____ every second for status
Coordinates allocation of computing resources for the cluster
Schedules containers for MapReduce tasks
Engages the application master process in a node
Polls the Application Master every second for status
Resource Manger Failure
SIngle point of ___
____ is achieved with a standby ____
Status is captrued in ______ or ____
Node manager is not stored in this _____
Single point of failure
High availability is achieved with a standby Resource Manager
Status is captured in HDFS or Zookeeper
Node manager is not stored in this recovery file
YARN Application Master
-___ the taks running the ___
-Runs in a ___ with a ____
-keeps track of each ____
Coordinates the tasks running the MapReduce job
Runs in a container with a MapReduce Task
Keeps track of each nodes progress
Application Master Writing Splits
____ creates a map task for each input split
creates a number of ____ tasks baded on setting
____ tasks to
-under 10 mappers/1 reducer (1 block runs in self)
-Large jobs assign more tasks in ___
Sets up ___
Application Master creates a map task for each input split
Creates a number of reduce tasks based on setting
Evaluates/assigns tasks to
– Under 10 mappers/1 reducer (1 block runs in self)
—Large jobs assign more tasks in parallel
Sets up output job
Application Master Failure
___ heart beats are sent to the ___
if an update is not recieved _____ creates a new _____
AM fail __ times (default) before the job is failed
Periodic heart beats are sent to the resource manager
If an update is not received, Resource Manager creates a new
Application Master
AM may fail 2 times (default) before the job is failed
YARN Node Managers
-___ and ___ computer containers on machine in the cluster
Launch and monitor compute containers on machines in the
cluster
Node Task Assignment
___ requests containers from the ____
Each task is provided ___ of memory and __ processing core
Request for ___ are fulfilled first with ___
Request for Reduce taks are made after ___
Honors running reduce taks on the ____
If the tasks are not on the ___ the execution is honored
Application Master requests containers from the Resource
Manager
Each task is provided 1024 MB of memory and one processing
core
Requests for Map tasks are fulfilled first with data locality
Requests for Reduce tasks are made after 5% map progress
Honors running reduce tasks on the same node as the map
If the tasks are not on the same rack the execution is honored
Node Task Execution
Application Master contacts the ____
Executes the task with ____
servers nodes ____
output committers ____
Application Master contacts the Node Manager
Executes the task with local node resources
Server nodes run independently
Output committers write output files
Node Manager/Task failure
_____ fails, node manager updates Application Master
Appliction Master kills taks with n______
AM reschules failed tasks on different ___
Taks can only fail __ configurable then the whole job is failed
___ (not used) tasks do not count
___ percentage of failures can also be configured
Java machine fails, node manager updates Application Master
Application Master kills tasks with no update over 10 min
AM reschedules failed tasks on different node
Tasks can only fail 4 times (configurable) then the whole job is
failed
AM killed (not used) tasks do not count
Percentage of failures can also be configured
Application master updates
overall job status _____
status on each task in the job (same….)
Reading an input ___
Map progress _____
Reduce prgoresss ____
writing an ____
Overall job status (running, successful, completed, failed)
Status on each task in the job (same ….)
Reading an input record
Map progress (% of task completed)
Reduce progress (estimate % of input processed)
Writing an output record
Job Status updates
-Tasks report back to the ___
-Application Mster reports ____
-On compeltion ststus Application Master and ___ clean up their working state
-history of the ____ is ___
Tasks report back to the Application Master
Application Master reports status to the client
On completion status Application Master and task containers
clean up their working state
History of the task completion is archived
Shuffle
-___ to every reducer has a __
-Shuffle is the process used to __ and __ map out to reducers
Input (map) to every reducer has a key
Shuffle is the process used to sort and transfer map output to
reducers
Mapping Output
-100 MB ___ where output is written
-At ___ (80MB) the buffer begins the ___
-Continue in parallel, if buffer fills, map is blocked until spll is ____
-____ are crated based on the __ reducer
-In memroy sort by __ is peformed within the partition
-____ are merged into a ___ and sorted output file
100 MB memory buffer where output is written
At 80% (80 MB) the buffer begins the spill to disk
Continue in parallel, if buffer fills, map is blocked until spill is
completed
Partitions are created based on the “to” reducer
In memory sort by key is performed within the partition
Spill files are merged into a single partitioned and sorted
output file
Combiners
Combiner uses the output of each ____
If ther are ___ or more spill files a combiner is run
Combiner uses the output of each partition sort
If there are 3 or more spill files a combiner is run
Copy Phase
Reudcer asks Application Master for ___
Reduce tasks start ocpying their map outputs as soon as they ______
Several ___ can be copying in ___
Reducer asks Application Master for map output
Reduce tasks start copying their map outputs as soon as they
are ready as the “copy phase”
Several threads can be copying in parallel
Merge Phase (Sort)
-After all map outputs are copied ___
-Aligned by the ___
After all map outputs are copied, sorting begins
Aligned by the “merge factor”
look at chart
MapReduce Process
map to shuffle/sort to reduce
look at chart
Input Text Data File
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner.
A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically, both the
input and the output of the job are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically, the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture
Guide) are running on the same set of nodes. This configuration allows the framework
to effectively schedule tasks on the nodes where data is already present, resulting
in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture
Guide).
Minimally, applications specify the input/output locations and supply map and reduce
functions via implementations of appropriate interfaces and/or abstract-classes.
These, and other job parameters, comprise the job configuration.
The Hadoop job client then submits the job (jar/executable etc.) and configuration
to the ResourceManager which then assumes the responsibility of distributing the
software/configuration to the workers, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in Java, MapReduce applications need
not be written in Java.
LOOK AT PIC
Mapper Takes Key/Value Pairs
(0,Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multiterabyte data-sets) in-parallel on large clusters (thousands
of nodes) of commodity hardware in a reliable, fault-tolerant
manner.)
(246,A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map tasks in a
completely parallel manner. The framework sorts the outputs of
the maps, which are then input to the reduce tasks. Typically,
both the input and the output of the job are stored in a filesystem. The framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.)
(653, Typically, the compute nodes and the storage nodes are
the same, that is, the MapReduce framework and the Hadoop
Distributed File System (see HDFS Architecture Guide) are
running on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth
across the cluster.)
LOOK AT PIC
Mappers
-number of mappers determinend by ___
Typically we can use __ for a small text file
-try to split at the ___ staying on one node
-____ will bog down a job with overhead
-Map spills are written to __________
-Map output is not save, only if _____
Number of mappers determined by input splits
Typically, we use one mapper for a small text file
Try to split at the block size staying on one node
Small splits will bog down a job with overhead
Map spills are written to local disk, not HDFS
Map output is not saved, only re-run if necessary
Mappers Provide Key Value Pairs
(Hadoop, 1)
(MapReduce, 1)
(is, 1)
(a, 1)
(software, 1)
(framework, 1)
(for, 1)
(easily, 1)
(writing, 1)
(applications, 1)
(which, 1)
(process, 1)
(vast, 1)
(amounts, 1)
(of, 1)
(data, 1)
(multi-terabyte, 1)
(data-sets, 1)
(in-parallel, 1)
(on, 1)
(large, 1)
(clusters, 1)
(thousands, 1)
(of, 1)
(nodes, 1)
(of, 1)
(commodity, 1)
(hardware, 1)
(in, 1)
(a, 1)
(reliable, 1)
(fault-tolerant, 1)
(manner, 1)
LOOK AT PIC
Key Value Pairs Input to Reducers
(key, value)
(of,[1,1,1,1])
(a,[1,1,1,1])
(writing,1)
(which,1)
(vast,1)
(thousands,1)
(software,1)
(reliable,,1)
(process,1)
(on,1)
(nodes,1)
(multi-terabyte,1)
(manner,1)
(large,1)
(is,1)
(key, value)
(in-parallel,1)
(in,1)
(hardware,1)
(framework,1)
(for,1)
(fault-tolerant,1)
(easily,1)
(data-sets,1)
(data,1)
(commodity,1)
(clusters,1)
(applications,1)
(amounts,1)
(MapReduce,1)
(Hadoop,1)
LOOK AT PIC
Reducers
-Number of reduercers is determined by the ___
-There is a file written for each _____
-Output of reducers is written to HDFS with ___
-If there are ____, how many files are saved
-FIles are usally ___ across the network
-Files will have more than ____, all files will have ____
Number of reducers is determined by the user
There is a file written for each reducer/partition
Output of reducers is written to HDFS with 3 replications
If there are two partitions, how many files are saved?
Files are usually transferred across the network
Files will have more than one key, all files will have distinct
keys
Key value pairs after reduce
(of,4)
(a,4)
(writing,1)
(which,1)
(vast,1)
(thousands,1)
(software,1)
(reliable,1)
(process,1)
(on,1)
(nodes,1)
(multiterabyte,1)
(manner,1)
(large,1)
(is,1)
50
(in-parallel,1)
(in,1)
(hardware,1)
(framework,1)
(for,1)
(faulttolerant,1)
(easily,1)
(data-sets,1)
(data,1)
(commodity,1)
(clusters,1)
(applications,1)
(amounts,1)
(MapReduce,1)
(Hadoop,1)
LOOK AT PIC
MapReduce Fow
map–map–map-map
|
shuffle/sort
|
reduce–reduce–reduce -reduce
LOOK AT PIC
Summary
MapReduce provides an efficient way to process large data sets
in parallel
MapReduce is the source code for all big data applications
Mapping identifies Key Values
Reducing identifies specific output for a solution
Mapping writes to multiple nodes as close as possible
Mappers may provide duplicate files to each Reducer
YARN provides a more scalable efficient approach to
MapReduce
Resource Manager coordinates allocation of computing
resources for the cluster
Application Master coordinates the tasks running the
MapReduce job
Node Managers launch and monitor compute containers on
machines in the cluster
Shuffle is the process used to sort and transfer map output to
reducer