MapReduce Flashcards
MapReduce
A MapReduce __ is a unit of work that the client wants to be performed
Input data and the MapReduce program are deployed to ____ for execution
The job is broken down into ___
____ is a process of identifying _____
____ is a process of iterative ___ to identify desired values
A MapReduce job is a unit of work that the client wants to
be performed
Input data and the MapReduce program are deployed to
server nodes for execution
The job is broken down into tasks (map and reduce)
Mapping is a process of identifying key/value pairs
Reducing is a process of iterative sorting to identify
desired values
Key Value Mapping
___ function extracts the year and temp (celcius)
Year and temp a are the ___
Mapper function extracts the year and temperature (celsius)
Year and temperature values are the output
Output
__ are output and sorted
Values are output and sorted
Reducing
___ iterates through the list and selcts the maximum temp values
Reducer
MapReduce Process Flow
Map extracts ____
Shuffle collects ___
Reduces located ____
Unix command equivalent function is ____
Map extracts values
Shuffle collects values
Reduces locates specific output
Unix command equivalent function is under the flow
Map Tasks
Input __
MapReduce ____
Configuration ____
Map ___
One map task is assgined to each ___
Input data
MapReduce program
Configuration information
Map tasks
One map task is assigned to each split
Map Selection of Nodes
Tries to run the map tasks on a node ____ where the data resides
If full looks for a ______ on the same server rack
If busy, an ___ off-rack node is supplied
Tries to run the map tasks on a node where the data resides
If full, looks for a free map node on the same server rack
If busy, an off-rack node is supplied
Reduce Tasks
Input is the ____ from all mappers
Output is ______ on the node where reudce is running
Input is the output from all mappers
Output is transferred and merged on the node where reduce is
running
The Shuffle
You can choose the ____ for a job
When multiple, map tasks ____, allocating one partition for each ____
You can choose the number of reduce tasks for a job
When multiple, map tasks partition output, allocating one
partition for each reduce job
Combiner Function
_____ is used when there are multiple spill files from mappers
combiner is used when there are multiple spill files from mappers
YARN
-Yet Another Resource Negotiatior
-Sometimes called the operating system of a ____
-With so many application running, there was a need for something to access to the _____
-With YARN, ___ is not limited to MapReduce accessing data
Yet Another Resource Negotiator
Sometimes called the operating system of a cluster
With so many applications running, there was a need for
something to coordinate access to the system resources
With YARN, Hadoop is not limited to MapReduce accessing
data
Multiple Application Engines
Batch programs (____,___)
Interactive SQL (___,___)
Advanced Analytics (___)
Streaming(_____)
Batch programs (MapReduce, Spark)
Interactive SQL (Hive, Impala)
Advanced Analytics (Spark)
Streaming (Spark Streaming)
Mapreduce Jobtracker performs job ____, ____ and ____
YARN uses ____ resources for these fucntions
MapReduce Jobtracker performs job scheduling, progress
monitoring, and job history
YARN uses three separate resources for these functions
CHART
Map Reduce 1 (YARN)
-Jobtracker(resource manager, appliction master, timeline server)
-Tasktracker(Node manger)
-Slot(container)
Scalability Improvment
MapReduce can scale to _______
YARN can scale to ______
YARN provides ____
YARN manges a pool of ______
MapReduce can scale to 4K nodes and 40,000 tasks
YARN can scale to 10K nodes and 100,000 tasks
YARN provides high availability
YARN manages a pool of resources versus fixed slots
YARN Daemons
-______(RM)
Runs on ___
Global resource ____
Arbitates sytem resource between ______
Pluggable scheduler to support _____
-_____(NM)
Runs on ____
COmmunicates with ___
Resource Manager (RM)
Runs on master node
Global resource scheduler
Arbitrates system resources between competing applications
Pluggable scheduler to support different algorithms
Node Manager (NM)
Runs on worker nodes
Communications with RM
YARN Daemon Model
look at graph
Applications on YARN
-_____
-Created by the RM ___
-Allocate a certain amount of resrouces (____,___) on a worker node
-Application run in ______
-_________
-One per ___
- ___/___ specifc
-runs in a ____
-Requests more containers to ____
Containers
Created by the RM upon request
Allocate a certain amount of resources (memory, CPU) on a
worker node
Applications run in one or more containers
Application Master (AM)
One per application
Framework/application specific
Runs in a container
Requests more containers to run application tasks
FiFO Scheudling
-Simple
-Not suitable for ______
-Large application will ____
Simple
Not suitable for shared clusters
Large application will backlog others
Capacity Scheduling
-provides ____ by queue
-____ can be aligned with oranization function
-______ allows idle resources to be shared
Provides multiple parallel jobs by queue
Queues can be aligned with organization functions
Queue elasticity allows idle resources to be shared
Fair scheduler
-___ are dynamically balanced for resources
-As succesive jobs are schduled within a queue, that queue shares the _____ of its resource
Queues are dynamically balanced for resources
As successive jobs are scheduled within a queue, that queue
shares the equal allocation of its resource