Big Data Lecture 09 Massive Parallel Processing II

What do we have to do if 1 reduce job fails?

Depends on the job, but sometimes it is necessary to restart the whole MapReduce job!

How well did you know this?

Not at all

Perfectly

What are the responsibilities of the JobTracker in MapReduce v1? What is the issue?

There are too many! <ul><li>Resource management,</li><li>scheduling,</li><li>monitoring,</li><li>job lifecycle,</li><li>fault-tolerance.</li></ul>

How well did you know this?

Not at all

Perfectly

What are 5 issues of MapReduce?

<ol><li>Scalability due to overheads (only < 4000 nodes, and 40 000 tasks),</li><li>JobTracker is a big bottleneck, all communication goes through there,</li><li>JobTracker is the jack of all trades, it does both scheduling and monitoring.</li><li>The allocation of slots and tasks is static, this is not flexible!</li><li>Slots are not fungible (interchangeable), we cannot change what they do!</li></ol>

How well did you know this?

Not at all

Perfectly

What is YARN? What are its main structures?

Yet Another Resource Negotiator! On top it has a ResourceManager, that virtualizes Containers at NodeManagers. <img></img>

How well did you know this?

Not at all

Perfectly

How does ResourceManager initiate a process?

It denotes an ‘Application Master’ that manages everything, and reports back to the manager. It has right to allocate jobs across other NodeManagers. <img></img>

How well did you know this?

Not at all

Perfectly

What are the responsibilities of the Application Master?

It executes and monitors all the tasks. But it handles only this one application and nothing else!

How well did you know this?

Not at all

Perfectly

What resources can be blocked in a container?

Traditionally: <ul><li>memory,</li><li>CPU,</li><li>network bandwidth (tutorial says yes, lecture no)</li></ul>but now also sometimes<ul><li>disk space.</li></ul>

How well did you know this?

Not at all

Perfectly

How are nodes being kept track of?

Using heartbearts, there is a nodes list manager that keeps track of all them and their resources.

How well did you know this?

Not at all

Perfectly

What is delegated from the ResourceManagers?

<ul><li>It does not have to monitor the tasks,</li><li>does not have to restart them on failure.</li></ul>

How well did you know this?

Not at all

Perfectly

What kind of schedulers are there?

<ul><li>FIFO,</li><li>Hierarchical queues (depending on percentages),</li><li>Capacity Scheduler (different queues based on percentages),</li><li>Fair Scheduler (super flexible).</li></ul>

How well did you know this?

Not at all

Perfectly

What is Dominant Resource Fairness?

The game theoretic optimal strategy for ranking resources: <ul><li>rank it on what is the most constraining!</li></ul>Each person gets the same share, but only in terms of one resource! They will not lie about what they need like this :)

How well did you know this?

Not at all

Perfectly

How is MapReduce generalized to Spark?

From two processes, we generalize the whole process on a DAG! (No directed cycles allowed, so we can toposort it.) It uses RDDs (Resilient Distributed Datasets), big dataset, not a key-value anymore, it is partitioned.

How well did you know this?

Not at all

Perfectly

What is RDD lifecycle?

<ul><li>Creation (e.g. read from HDFS),</li><li>Transformation (e.g. remove duplicates),</li><li>Action (e.g. store or print).</li></ul>

How well did you know this?

Not at all

Perfectly

How is the Spark DAG evaluated?

Toposorted, and then evaluated, so it can have multiple entry nodes. Lazy evaluation: only evaluate what you need for the action.

How well did you know this?

Not at all

Perfectly

What Spark 1-1 transformations are there?

<ul><li>Filter,</li><li>map,</li><li>flatMap (actually the one we have in MapReduce most generally),</li><li>distinct,</li><li>sample.</li></ul>

How well did you know this?

Not at all

Perfectly

What Spark 2-1 tranformations are there?

Study These Flashcards

<ul><li>Union,</li><li>intersection,</li><li>subtract,</li><li>Cartesian product.</li></ul>

What actions are there in Spark?

Study These Flashcards

<ul><li>Collect (RDD spread from multiple nodes, dangerous for big data),</li><li>count,</li><li>count by value,</li><li>take (first) and top (last),</li><li>takeSample,</li><li>reduce (+ specify neutral element for empty/invalid elements).</li></ul>

What tranformations on pair (KeyValue) RDDs are there?

Study These Flashcards

<ul><li>Keys,</li><li>values,</li><li>reduce by key,</li><li>group by key,</li><li>sort by key (MapReduce shuffle),</li><li>map values,</li><li>join (on keys),</li><li>subract by key,</li><li>count by key,</li><li>lookup.</li></ul>

How can task be most efficiently parallelized on Spark?

Study These Flashcards

<ul><li>Tasks are spread over executors (yarn containers), which assign tasks to cores.</li><li>We use narrow dependency to do all the queries at the data (use network the least amount possible).</li></ul>

What is the difference between narrow and wide dependency and how to exploit it?

Study These Flashcards

Narrow dependency: new computation depends only on subset of previous data (e.g. the same key), wide dependency: new computation depends on all data, it all has to be shuffled. If there is a narrow dependency, we can do all the operations on the same node, and skip transferring data, makes it all more efficient.

What is a stage? How can a whole job be described using stages?

Study These Flashcards

Set of consecutive operations, between which the data is never made persistent or emmited. All executed at once. Job is a sequence of stages.

What is the relation of Transformations, Stages, Tasks and Job?

Study These Flashcards

How is Spark computational graph used to execute jobs on RDDs? How does this lead to inefficiency? And how is this solved?

Study These Flashcards

The graph is a DAG, and the evaluation is lazy: it is only done on action call (e.g. collect). This means that computation of the same nodes, can be trigerred multiple times for different purposes. This can be solved by making RDDs persist in the memory.

How can wide dependencies be avoided?

Study These Flashcards

By pre-partitioning the data (e.g. keyValues with the same key go onto the same machine).

Why use Spark with dataframes? Do we have to provide a schema?

Spark go the reputation of being slow: if data is validated, it can be stored and processed more efficiently, so we use dataframes.

Schema can be either provided or inferred (slower, more costly and error prone).

How does Spark handle data independence?

We have the logical plan of execution (e.g. specified in Python) and then Spark creates physical plan of execution (which is hidden to the user).

This is done through a complicated multiple step process that is hidden to the user (e.g. optimization of plan...).

Can we use SQL queries in Spark?

There is an independent SparkSQL dialect that translates SQL to Spark.

How can we deal with nestedness in arrays in Spark?

Explode: split arrays into different rows in the same column instead.

Then join using LATERAL VIEW!

Can DataFrame be transformed into RDDs? And vice versa?

Dataframe to RDD is easy, we just call the dataframe and RDD.

The reverse is more difficult, but also possible.

Big Data Lecture 09 Massive Parallel Processing II_ Spark Flashcards

(29 cards)