Batch - Yarn and MapReduce Flashcards by Eric Sultini

YARN (Yet Another Resource Negotiator)

Resource management system designed to handle distributed computing

How well did you know this?

Not at all

Perfectly

YARN APIs

Request and work with cluster resources (not made by user code, but by framework itself!)

How well did you know this?

Not at all

Perfectly

Fundamental idea of YARN

Split functionalities of resource management and job scheduling.

How well did you know this?

Not at all

Perfectly

What makes Yarn scheduler a “pure scheduler”?

Doesn’t monitor application/job status
Doesn’t restart application/job on failure

How well did you know this?

Not at all

Perfectly

Application

Single job or DAG of jobs

How well did you know this?

Not at all

Perfectly

Applications Manager job

Accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails

How well did you know this?

Not at all

Perfectly

FIFO scheduler

No configuration necessary, bad for clusters

How well did you know this?

Not at all

Perfectly

Capacity Scheduler

Fixed amount of capacity to each job

How well did you know this?

Not at all

Perfectly

Fair Scheduler

Balances available resources between running jobs

How well did you know this?

Not at all

Perfectly

Resource Manager (RM) (def) (2)

Ultimate authority allocating containers,
1. Accept job submissions from
client
2. Set up ApplicationsMaster (w/ initial container)

How well did you know this?

Not at all

Perfectly

Node Manager (NM)

A per-machine agent monitoring resource usage of containers and reports it to RM

How well did you know this?

Not at all

Perfectly

ApplicationsMaster (2)

Manage job lifecycle and request containers from RM

How well did you know this?

Not at all

Perfectly

Upon request from client, RM finds a NM to launch ______ in a ___________.

Application Master Process; container

How well did you know this?

Not at all

Perfectly

Container

Slice of computing resources, reports job status to AMP

How well did you know this?

Not at all

Perfectly

Data Locality (YARN)

Ensuring tasks are run as close to the data as possible

How well did you know this?

Not at all

Perfectly

4 Levels of Data Locality

Study These Flashcards

Node-level
Rack-level
Data Center-level
Inter-data center

MapReduce

Study These Flashcards

A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster

Map Phase

Study These Flashcards

Dataset is partitioned into smaller chunks (input splits) and processed in parallel, turning data into key-value pairs

Sort & Shuffle Phase

Study These Flashcards

Data is sorted by key and shuffled (moved) in groups to reducers.

Reduce Phase

Study These Flashcards

Intermediate data is aggregated by reduce function

Combiner

Study These Flashcards

In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.

Combiner function must be _______________ and ______________

Study These Flashcards

Associative (a + b) + c = a + (b + c) - grouping

Commutative a + b = b + c - order

Map or Reduce task failure

Study These Flashcards

Rescheduled on another node

Map or Reduce node failure

Study These Flashcards

All tasks on node rescheduled on another node
worst case: restart entire job

5 Steps of MapReduce process

1. Input divided into fixed-size splits 2. User-defined map function for each record in split 3. Key-value pairs sorted by key and stored on disk 4. Sent to reducers that combine values of a given key 5. Results written onto DFS

One global reduce task solution

originally, hadoop had one reduce task for an entire job regardless of data size

One reduce task per CPU solution

number of reduce tasks based on number of CPU cores, but can cause an imbalance because some keys have more data than others

Many reduce tasks per CPU solution

Reduce tasks > CPU cores, each task handling fewer keys for more balanced workload, but more overhead

Rule of thumb for picking number of Reduce Tasks

Run for 5 minutes and produce at least one HDFS Block

Partition function

Optional MapReduce function to determine how intermediate data is distributed to reduce tasks. Same keys go to same reducers, or aim for balanced workload.

How does job get submitted in MapReduce?

Developer launches job on Java Virtual Machine (JVM) which then contacts YARN RM

Batch - Yarn and MapReduce Flashcards

(31 cards)