Big Data Lecture 09 Massive Parallel Processing II_ Spark Flashcards

1
Q

What do we have to do if 1 reduce job fails?

A

Depends on the job, but sometimes it is necessary to restart the whole MapReduce job!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the responsibilities of the JobTracker in MapReduce v1? What is the issue?

A

There are too many!<br></br><ul><li>Resource management,</li><li>scheduling,</li><li>monitoring,</li><li>job lifecycle,</li><li>fault-tolerance.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are 5 issues of MapReduce?

A

<ol><li>Scalability due to overheads (only &lt; 4000 nodes, and 40 000 tasks),</li><li>JobTracker is a big bottleneck, all communication goes through there,</li><li>JobTracker is the jack of all trades, it does both scheduling and monitoring.</li><li>The allocation of slots and tasks is static, this is not flexible!</li><li>Slots are not fungible (interchangeable), we cannot change what they do!</li></ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is YARN? What are its main structures?

A

Yet Another Resource Negotiator!<br></br><br></br>On top it has a ResourceManager, that virtualizes Containers at NodeManagers.<br></br><img></img>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does ResourceManager initiate a process?

A

It denotes an ‘Application Master’ that manages everything, and reports back to the manager. It has right to allocate jobs across other NodeManagers.<br></br><br></br><img></img>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the responsibilities of the Application Master?

A

 It executes and monitors all the tasks. But it handles only this one application and nothing else!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What resources can be blocked in a container?

A

Traditionally:<br></br><ul><li>memory,</li><li>CPU,</li><li>network bandwidth (tutorial says yes, lecture no)</li></ul>but now also sometimes<ul><li>disk space.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are nodes being kept track of?

A

Using heartbearts, there is a nodes list manager that keeps track of all them and their resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is delegated from the ResourceManagers?

A

<ul><li>It does not have to monitor the tasks,</li><li>does not have to restart them on failure.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What kind of schedulers are there?

A

<ul><li>FIFO,</li><li>Hierarchical queues (depending on percentages),</li><li>Capacity Scheduler (different queues based on percentages),</li><li>Fair Scheduler (super flexible).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Dominant Resource Fairness?

A

The game theoretic optimal strategy for ranking resources:<br></br><ul><li>rank it on what is the most constraining!</li></ul>Each person gets the same share, but only in terms of one resource! They will not lie about what they need like this :)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is MapReduce generalized to Spark?

A

From two processes, we generalize the whole process on a DAG! (No directed cycles allowed, so we can toposort it.)<br></br><br></br>It uses RDDs (Resilient Distributed Datasets), big dataset, not a key-value anymore, it is partitioned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is RDD lifecycle?

A

<ul><li>Creation (e.g. read from HDFS),</li><li>Transformation (e.g. remove duplicates),</li><li>Action (e.g. store or print).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is the Spark DAG evaluated?

A

Toposorted, and then evaluated, so it can have multiple entry nodes.<br></br><br></br>Lazy evaluation: only evaluate what you need for the action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What Spark 1-1 transformations are there?

A

<ul><li>Filter,</li><li>map,</li><li>flatMap (actually the one we have in MapReduce most generally),</li><li>distinct,</li><li>sample.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What Spark 2-1 tranformations are there?

A

<ul><li>Union,</li><li>intersection,</li><li>subtract,</li><li>Cartesian product.</li></ul>

17
Q

What actions are there in Spark?

A

<ul><li>Collect (RDD spread from multiple nodes, dangerous for big data),</li><li>count,</li><li>count by value,</li><li>take (first) and top (last),</li><li>takeSample,</li><li>reduce (+ specify neutral element for empty/invalid elements).</li></ul>

18
Q

What tranformations on pair (KeyValue) RDDs are there?

A

<ul><li>Keys,</li><li>values,</li><li>reduce by key,</li><li>group by key,</li><li>sort by key (MapReduce shuffle),</li><li>map values,</li><li>join (on keys),</li><li>subract by key,</li><li>count by key,</li><li>lookup.</li></ul>

19
Q

How can task be most efficiently parallelized on Spark?

A

<ul><li>Tasks are spread over executors (yarn containers), which assign tasks to cores.</li><li>We use narrow dependency to do all the queries at the data (use network the least amount possible).</li></ul>

20
Q

What is the difference between narrow and wide dependency and how to exploit it?

A

Narrow dependency: new computation depends only on subset of previous data (e.g. the same key),<br></br>wide dependency: new computation depends on all data, it all has to be shuffled.<br></br><br></br>If there is a narrow dependency, we can do all the operations on the same node, and skip transferring data, makes it all more efficient.

21
Q

What is a stage? How can a whole job be described using stages?

A

Set of consecutive operations, between which the data is never made persistent or emmited. All executed at once.<br></br><br></br>Job is a sequence of stages.

22
Q

What is the relation of Transformations, Stages, Tasks and Job?

A

<img></img>

23
Q

How is Spark computational graph used to execute jobs on RDDs? How does this lead to inefficiency? And how is this solved?

A

The graph is a DAG, and the evaluation is lazy: it is only done on action call (e.g. collect).<br></br><br></br>This means that computation of the same nodes, can be trigerred multiple times for different purposes.<br></br><br></br>This can be solved by making RDDs persist in the memory.

24
Q

How can wide dependencies be avoided?

A

By pre-partitioning the data (e.g. keyValues with the same key go onto the same machine).

25
Q

Why use Spark with dataframes? Do we have to provide a schema?

A

Spark go the reputation of being slow: if data is validated, it can be stored and processed more efficiently, so we use dataframes.<br></br><br></br>Schema can be either provided or inferred (slower, more costly and error prone).

26
Q

How does Spark handle data independence?

A

We have the <b>logical plan </b>of execution<b> </b>(e.g. specified in Python) and then Spark creates <b>physical plan </b>of execution (which is hidden to the user).<br></br><br></br>This is done through a complicated multiple step process that is hidden to the user (e.g. optimization of plan…).

27
Q

Can we use SQL queries in Spark?

A

There is an independent SparkSQL dialect that translates SQL to Spark.

28
Q

How can we deal with nestedness in arrays in Spark?

A

Explode: split arrays into different rows in the same column instead.<br></br><br></br>Then join using LATERAL VIEW!

29
Q

Can DataFrame be transformed into RDDs? And vice versa?

A

Dataframe to RDD is easy, we just call the dataframe and RDD.<br></br><br></br>The reverse is more difficult, but also possible.