Batch - Yarn and MapReduce Flashcards
YARN (Yet Another Resource Negotiator)
Resource management system designed to handle distributed computing
YARN APIs
Request and work with cluster resources (not made by user code, but by framework itself!)
Fundamental idea of YARN
Split functionalities of resource management and job scheduling.
What makes Yarn scheduler a “pure scheduler”?
Doesn’t monitor application/job status
Doesn’t restart application/job on failure
Application
Single job or DAG of jobs
Applications Manager job
Accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails
FIFO scheduler
No configuration necessary, bad for clusters
Capacity Scheduler
Fixed amount of capacity to each job
Fair Scheduler
Balances available resources between running jobs
Resource Manager (RM) (def) (2)
Ultimate authority allocating containers,
1. Accept job submissions from
client
2. Set up ApplicationsMaster (w/ initial container)
Node Manager (NM)
A per-machine agent monitoring resource usage of containers and reports it to RM
ApplicationsMaster (2)
Manage job lifecycle and request containers from RM
Upon request from client, RM finds a NM to launch ______ in a ___________.
Application Master Process; container
Container
Slice of computing resources, reports job status to AMP
Data Locality (YARN)
Ensuring tasks are run as close to the data as possible
4 Levels of Data Locality
- Node-level
- Rack-level
- Data Center-level
- Inter-data center
MapReduce
A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster
Map Phase
Dataset is partitioned into smaller chunks (input splits) and processed in parallel, turning data into key-value pairs
Sort & Shuffle Phase
Data is sorted by key and shuffled (moved) in groups to reducers.
Reduce Phase
Intermediate data is aggregated by reduce function
Combiner
In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.
Combiner function must be _______________ and ______________
Associative (a + b) + c = a + (b + c) - grouping
Commutative a + b = b + c - order
Map or Reduce task failure
Rescheduled on another node
Map or Reduce node failure
All tasks on node rescheduled on another node
worst case: restart entire job
5 Steps of MapReduce process
- Input divided into fixed-size splits
- User-defined map function for each record in split
- Key-value pairs sorted by key and stored on disk
- Sent to reducers that combine values of a given key
- Results written onto DFS
One global reduce task solution
originally, hadoop had one reduce task for an entire job regardless of data size
One reduce task per CPU solution
number of reduce tasks based on number of CPU cores, but can cause an imbalance because some keys have more data than others
Many reduce tasks per CPU solution
Reduce tasks > CPU cores, each task handling fewer keys for more balanced workload, but more overhead
Rule of thumb for picking number of Reduce Tasks
Run for 5 minutes and produce at least one HDFS Block
Partition function
Optional MapReduce function to determine how intermediate data is distributed to reduce tasks.
Same keys go to same reducers, or aim for balanced workload.
How does job get submitted in MapReduce?
Developer launches job on Java Virtual Machine (JVM) which then contacts YARN RM