Batch - Yarn and MapReduce Flashcards
YARN (Yet Another Resource Negotiator)
Resource management system designed to handle distributed computing
YARN APIs
Request and work with cluster resources (not made by user code! by framework!)
Two key components of YARN
- Scheduler (allocate resources)
- Applications Manager (accept job submissions, containers for AMP)
What makes Yarn scheduler a “pure scheduler”?
no monitoring status for application
no restarting mechanism
Applications Manager job
accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails
YARN provides choice of schedulers, here are 3 types:
FIFO (no configuration necessary, bad for clusters)
Capacity Scheduler (fixed amount of capacity to each job)
Fair Scheduler (balances available resources between running jobs)
YARN Resource Manager (RM)
Global daemon (one per cluster) manages resource allocation in the cluster
YARN Node Manager (NM)
A daemon running on each node in cluster monitoring resource usage and reporting back to Resource Manager.
Upon request from client, RM finds a NM to launch ______ in a ___________.
Application Master Process; container
Job of AMP
Execute computation!
Data Locality (YARN)
YARN optimizes job execution by ensuring tasks are run as close to the data as possible. (move computation, not data)
4 Levels of Data Locality
- Node-level
- Rack-level
- Data Center-level
- Inter-data center
MapReduce
A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster
Map Phase
Large dataset is partitioned into smaller chunks (input splits) and processed in parallel by map tasks that turn data into key-value pairs
Sort & Shuffle
Data is sorted by key and shuffled (moved) in groups to reducers.