Big Data Lecture 09 Massive Parallel Processing II_ Spark Flashcards
What do we have to do if 1 reduce job fails?
Depends on the job, but sometimes it is necessary to restart the whole MapReduce job!
What are the responsibilities of the JobTracker in MapReduce v1? What is the issue?
There are too many!<br></br><ul><li>Resource management,</li><li>scheduling,</li><li>monitoring,</li><li>job lifecycle,</li><li>fault-tolerance.</li></ul>
What are 5 issues of MapReduce?
<ol><li>Scalability due to overheads (only < 4000 nodes, and 40 000 tasks),</li><li>JobTracker is a big bottleneck, all communication goes through there,</li><li>JobTracker is the jack of all trades, it does both scheduling and monitoring.</li><li>The allocation of slots and tasks is static, this is not flexible!</li><li>Slots are not fungible (interchangeable), we cannot change what they do!</li></ol>
What is YARN? What are its main structures?
Yet Another Resource Negotiator!<br></br><br></br>On top it has a ResourceManager, that virtualizes Containers at NodeManagers.<br></br><img></img>
How does ResourceManager initiate a process?
It denotes an ‘Application Master’ that manages everything, and reports back to the manager. It has right to allocate jobs across other NodeManagers.<br></br><br></br><img></img>
What are the responsibilities of the Application Master?
It executes and monitors all the tasks. But it handles only this one application and nothing else!
What resources can be blocked in a container?
Traditionally:<br></br><ul><li>memory,</li><li>CPU,</li><li>network bandwidth (tutorial says yes, lecture no)</li></ul>but now also sometimes<ul><li>disk space.</li></ul>
How are nodes being kept track of?
Using heartbearts, there is a nodes list manager that keeps track of all them and their resources.
What is delegated from the ResourceManagers?
<ul><li>It does not have to monitor the tasks,</li><li>does not have to restart them on failure.</li></ul>
What kind of schedulers are there?
<ul><li>FIFO,</li><li>Hierarchical queues (depending on percentages),</li><li>Capacity Scheduler (different queues based on percentages),</li><li>Fair Scheduler (super flexible).</li></ul>
What is Dominant Resource Fairness?
The game theoretic optimal strategy for ranking resources:<br></br><ul><li>rank it on what is the most constraining!</li></ul>Each person gets the same share, but only in terms of one resource! They will not lie about what they need like this :)
How is MapReduce generalized to Spark?
From two processes, we generalize the whole process on a DAG! (No directed cycles allowed, so we can toposort it.)<br></br><br></br>It uses RDDs (Resilient Distributed Datasets), big dataset, not a key-value anymore, it is partitioned.
What is RDD lifecycle?
<ul><li>Creation (e.g. read from HDFS),</li><li>Transformation (e.g. remove duplicates),</li><li>Action (e.g. store or print).</li></ul>
How is the Spark DAG evaluated?
Toposorted, and then evaluated, so it can have multiple entry nodes.<br></br><br></br>Lazy evaluation: only evaluate what you need for the action.
What Spark 1-1 transformations are there?
<ul><li>Filter,</li><li>map,</li><li>flatMap (actually the one we have in MapReduce most generally),</li><li>distinct,</li><li>sample.</li></ul>