Batch - Yarn and MapReduce Flashcards

1
Q

YARN (Yet Another Resource Negotiator)

A

Resource management system designed to handle distributed computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

YARN APIs

A

Request and work with cluster resources (not made by user code! by framework!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Two key components of YARN

A
  1. Scheduler (allocate resources)
  2. Applications Manager (accept job submissions, containers for AMP)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What makes Yarn scheduler a “pure scheduler”?

A

no monitoring status for application
no restarting mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Applications Manager job

A

accept job submissions, negotiate container for executing AMP, provide service for restarting if AMP fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

YARN provides choice of schedulers, here are 3 types:

A

FIFO (no configuration necessary, bad for clusters)
Capacity Scheduler (fixed amount of capacity to each job)
Fair Scheduler (balances available resources between running jobs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

YARN Resource Manager (RM)

A

Global daemon (one per cluster) manages resource allocation in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

YARN Node Manager (NM)

A

A daemon running on each node in cluster monitoring resource usage and reporting back to Resource Manager.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Upon request from client, RM finds a NM to launch ______ in a ___________.

A

Application Master Process; container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Job of AMP

A

Execute computation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Locality (YARN)

A

YARN optimizes job execution by ensuring tasks are run as close to the data as possible. (move computation, not data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

4 Levels of Data Locality

A
  1. Node-level
  2. Rack-level
  3. Data Center-level
  4. Inter-data center
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

MapReduce

A

A programming model that allows developers to write programs that can process large amounts of data in parallel across a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Map Phase

A

Large dataset is partitioned into smaller chunks (input splits) and processed in parallel by map tasks that turn data into key-value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sort & Shuffle

A

Data is sorted by key and shuffled (moved) in groups to reducers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Reduce phase

A

Intermediate data is aggregated by reduce function

17
Q

Combiner

A

In some cases, a mini-reduce function used as an optimizer between Map and Reduce phases so less data is transferred.

18
Q

Map or Reduce task failure

A

rescheduled on another node

19
Q

Map or Reduce node failure

A

all tasks on node rescheduled on another node
worst case: restart entire job