Big Data Lecture 09 Massive Parallel Processing II_ Spark Flashcards
What do we have to do if 1 reduce job fails?
Depends on the job, but sometimes it is necessary to restart the whole MapReduce job!
What are the responsibilities of the JobTracker in MapReduce v1? What is the issue?
There are too many!<br></br><ul><li>Resource management,</li><li>scheduling,</li><li>monitoring,</li><li>job lifecycle,</li><li>fault-tolerance.</li></ul>
What are 5 issues of MapReduce?
<ol><li>Scalability due to overheads (only < 4000 nodes, and 40 000 tasks),</li><li>JobTracker is a big bottleneck, all communication goes through there,</li><li>JobTracker is the jack of all trades, it does both scheduling and monitoring.</li><li>The allocation of slots and tasks is static, this is not flexible!</li><li>Slots are not fungible (interchangeable), we cannot change what they do!</li></ol>
What is YARN? What are its main structures?
Yet Another Resource Negotiator!<br></br><br></br>On top it has a ResourceManager, that virtualizes Containers at NodeManagers.<br></br><img></img>
How does ResourceManager initiate a process?
It denotes an ‘Application Master’ that manages everything, and reports back to the manager. It has right to allocate jobs across other NodeManagers.<br></br><br></br><img></img>
What are the responsibilities of the Application Master?
It executes and monitors all the tasks. But it handles only this one application and nothing else!
What resources can be blocked in a container?
Traditionally:<br></br><ul><li>memory,</li><li>CPU,</li><li>network bandwidth (tutorial says yes, lecture no)</li></ul>but now also sometimes<ul><li>disk space.</li></ul>
How are nodes being kept track of?
Using heartbearts, there is a nodes list manager that keeps track of all them and their resources.
What is delegated from the ResourceManagers?
<ul><li>It does not have to monitor the tasks,</li><li>does not have to restart them on failure.</li></ul>
What kind of schedulers are there?
<ul><li>FIFO,</li><li>Hierarchical queues (depending on percentages),</li><li>Capacity Scheduler (different queues based on percentages),</li><li>Fair Scheduler (super flexible).</li></ul>
What is Dominant Resource Fairness?
The game theoretic optimal strategy for ranking resources:<br></br><ul><li>rank it on what is the most constraining!</li></ul>Each person gets the same share, but only in terms of one resource! They will not lie about what they need like this :)
How is MapReduce generalized to Spark?
From two processes, we generalize the whole process on a DAG! (No directed cycles allowed, so we can toposort it.)<br></br><br></br>It uses RDDs (Resilient Distributed Datasets), big dataset, not a key-value anymore, it is partitioned.
What is RDD lifecycle?
<ul><li>Creation (e.g. read from HDFS),</li><li>Transformation (e.g. remove duplicates),</li><li>Action (e.g. store or print).</li></ul>
How is the Spark DAG evaluated?
Toposorted, and then evaluated, so it can have multiple entry nodes.<br></br><br></br>Lazy evaluation: only evaluate what you need for the action.
What Spark 1-1 transformations are there?
<ul><li>Filter,</li><li>map,</li><li>flatMap (actually the one we have in MapReduce most generally),</li><li>distinct,</li><li>sample.</li></ul>
What Spark 2-1 tranformations are there?
<ul><li>Union,</li><li>intersection,</li><li>subtract,</li><li>Cartesian product.</li></ul>
What actions are there in Spark?
<ul><li>Collect (RDD spread from multiple nodes, dangerous for big data),</li><li>count,</li><li>count by value,</li><li>take (first) and top (last),</li><li>takeSample,</li><li>reduce (+ specify neutral element for empty/invalid elements).</li></ul>
What tranformations on pair (KeyValue) RDDs are there?
<ul><li>Keys,</li><li>values,</li><li>reduce by key,</li><li>group by key,</li><li>sort by key (MapReduce shuffle),</li><li>map values,</li><li>join (on keys),</li><li>subract by key,</li><li>count by key,</li><li>lookup.</li></ul>
How can task be most efficiently parallelized on Spark?
<ul><li>Tasks are spread over executors (yarn containers), which assign tasks to cores.</li><li>We use narrow dependency to do all the queries at the data (use network the least amount possible).</li></ul>
What is the difference between narrow and wide dependency and how to exploit it?
Narrow dependency: new computation depends only on subset of previous data (e.g. the same key),<br></br>wide dependency: new computation depends on all data, it all has to be shuffled.<br></br><br></br>If there is a narrow dependency, we can do all the operations on the same node, and skip transferring data, makes it all more efficient.
What is a stage? How can a whole job be described using stages?
Set of consecutive operations, between which the data is never made persistent or emmited. All executed at once.<br></br><br></br>Job is a sequence of stages.
What is the relation of Transformations, Stages, Tasks and Job?
<img></img>
How is Spark computational graph used to execute jobs on RDDs? How does this lead to inefficiency? And how is this solved?
The graph is a DAG, and the evaluation is lazy: it is only done on action call (e.g. collect).<br></br><br></br>This means that computation of the same nodes, can be trigerred multiple times for different purposes.<br></br><br></br>This can be solved by making RDDs persist in the memory.
How can wide dependencies be avoided?
By pre-partitioning the data (e.g. keyValues with the same key go onto the same machine).
Why use Spark with dataframes? Do we have to provide a schema?
Spark go the reputation of being slow: if data is validated, it can be stored and processed more efficiently, so we use dataframes.<br></br><br></br>Schema can be either provided or inferred (slower, more costly and error prone).
How does Spark handle data independence?
We have the <b>logical plan </b>of execution<b> </b>(e.g. specified in Python) and then Spark creates <b>physical plan </b>of execution (which is hidden to the user).<br></br><br></br>This is done through a complicated multiple step process that is hidden to the user (e.g. optimization of plan…).
Can we use SQL queries in Spark?
There is an independent SparkSQL dialect that translates SQL to Spark.
How can we deal with nestedness in arrays in Spark?
Explode: split arrays into different rows in the same column instead.<br></br><br></br>Then join using LATERAL VIEW!
Can DataFrame be transformed into RDDs? And vice versa?
Dataframe to RDD is easy, we just call the dataframe and RDD.<br></br><br></br>The reverse is more difficult, but also possible.