Batch - Yarn and MapReduce Flashcards

1
Q

YARN (Yet Another Resource Negotiator)

A

Resource management system designed to handle distributed computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

YARN APIs

A

Request and work with cluster resources (not made by user code, but by framework itself!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Fundamental idea of YARN

A

Split functionalities of resource management and job scheduling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What makes Yarn scheduler a “pure scheduler”?

A

Doesn’t monitor application/job status
Doesn’t restart application/job on failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Application

A

Single job or DAG of jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Applications Manager job

A

Accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

FIFO scheduler

A

No configuration necessary, bad for clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Capacity Scheduler

A

Fixed amount of capacity to each job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fair Scheduler

A

Balances available resources between running jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Resource Manager (RM) (def) (2)

A

Ultimate authority allocating containers,
1. Accept job submissions from
client
2. Set up ApplicationsMaster (w/ initial container)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Node Manager (NM)

A

A per-machine agent monitoring resource usage of containers and reports it to RM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ApplicationsMaster (2)

A

Manage job lifecycle and request containers from RM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Upon request from client, RM finds a NM to launch ______ in a ___________.

A

Application Master Process; container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Container

A

Slice of computing resources, reports job status to AMP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Locality (YARN)

A

Ensuring tasks are run as close to the data as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

4 Levels of Data Locality

A
  1. Node-level
  2. Rack-level
  3. Data Center-level
  4. Inter-data center
17
Q

MapReduce

A

A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster

18
Q

Map Phase

A

Dataset is partitioned into smaller chunks (input splits) and processed in parallel, turning data into key-value pairs

19
Q

Sort & Shuffle Phase

A

Data is sorted by key and shuffled (moved) in groups to reducers.

20
Q

Reduce Phase

A

Intermediate data is aggregated by reduce function

21
Q

Combiner

A

In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.

22
Q

Combiner function must be _______________ and ______________

A

Associative (a + b) + c = a + (b + c) - grouping

Commutative a + b = b + c - order

23
Q

Map or Reduce task failure

A

Rescheduled on another node

24
Q

Map or Reduce node failure

A

All tasks on node rescheduled on another node
worst case: restart entire job

25
Q

5 Steps of MapReduce process

A
  1. Input divided into fixed-size splits
  2. User-defined map function for each record in split
  3. Key-value pairs sorted by key and stored on disk
  4. Sent to reducers that combine values of a given key
  5. Results written onto DFS
26
Q

One global reduce task solution

A

originally, hadoop had one reduce task for an entire job regardless of data size

27
Q

One reduce task per CPU solution

A

number of reduce tasks based on number of CPU cores, but can cause an imbalance because some keys have more data than others

27
Q

Many reduce tasks per CPU solution

A

Reduce tasks > CPU cores, each task handling fewer keys for more balanced workload, but more overhead

28
Q

Rule of thumb for picking number of Reduce Tasks

A

Run for 5 minutes and produce at least one HDFS Block

29
Q

Partition function

A

Optional MapReduce function to determine how intermediate data is distributed to reduce tasks.

Same keys go to same reducers, or aim for balanced workload.

30
Q

How does job get submitted in MapReduce?

A

Developer launches job on Java Virtual Machine (JVM) which then contacts YARN RM