Module 8a - Apache Spark Flashcards
<p>What is Apache Spark?</p>
<p>Apache Spark is an open-source framework/tool used for large-scale data processing.</p>
<p>Where is Spark data stored?
| <br></br>What language is it written in?</p>
- Spark data is stored in-memory
- Written in Scala, runs on JVM
What is a benefit of Spark being run on JVM?
- All the Java methods can be invoked and used natively
- All other benefits from using a VM to run a program (safety, isolation, replicability, etc)
What is a Resilient Distributed Data set (RDD)? What does it represent? Is it mutable?
RDD is the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel
<p>RDDs (Resilient Distributed Dataset) are data structures. what are their key characteristics?</p>
- Fault tolerant
- Immutable
- Is Partitioned (is a list of items)
- Optimized to be operated on in parallel
- Lets users explicitly persist intermediate results in memory
<p>What was the problem with Hadoop MapReduce that lead to the development of Spark?</p>
<p>Hadoop MR lacked abstractions for leveraging distributed memory.
<br></br>This made it inefficient for applications that reuse intermediate results across multiple computations.</p>
What is lineage with respect to RDD?
How an RDD was derived from other RDDs. It is used to rebuild an RDD in the event of a failure
(direct definition)
<p>What are Transformations in Apache Spark?</p>
<p>Data operations that convert one RDD or a pair of RDDs into another RDD</p>
<p>What are "Actions" in Spark?</p>
<p>Data operations that convert an RDD into an output. Ex: number or a sequence of value</p>
<p>What happens with the Spark scheduler when an action is invoked on an RDD?</p>
<p>The Spark Scheduler examines the lineage graph of the RDD and builds a directed acyclic graph (DAG) of transformations</p>
<p>After the Spark Scheduler builds a DAGof transformations, they are further grouped into "stages". What is a "stage"?</p>
<p>A Stage is a collection of transformations with <strong>narrow dependencies</strong>, meaning that one partition of the output depends only on one partition of each input. Ex: filtering</p>
In Apache Spark what are characteristics ofnarrow dependencies?
What are characteristics of of wide dependencies?
Narrow Dependencies:
- One partition of the output depends on only one partition of each input (eg: map)
- Transformations with narrow dependencies can be pipelined together effectively
Wide Dependencies:
- One partition of the output depends on multiple partitions of some input (eg: groupByKey)
- Transformations with wide dependencies require a shuffle
Describe how Spark typicallyexecutesof a job in stages while leveraging pipelining. How does the program go from initial RDDs, to intermediate RDDs, to a final RDD?
What is the benefit of executing jobs in stages on throughput?
We can define initial RDDs, intermediate RDDs (which lead towards to final) and a final RDD. Narrow and Wide dependency transformations must be applied to the initial RDDs to bring them towards the intermediate RDDs and then eventually towards the final RDDs.
These transformations are broken down into stages, which are pipelined (run in parallel) - significantly improving throughput
Describehow the word count problem (get the frequency of every word in a file) can be done in Scala
How the code look like?
- Build an RDD using data in HDFS
- Transform textFile to counts by applying a sequence of the following operators:
- toeknize each line of input
-map each word to the tuple (word, 1)
- reduce the word counts - Dump the output back to HDFS
<p>Classify the following transformations as narrow or wide dependency:</p>
<p>- FlatMap</p>
<p>- Map</p>
<p>- ReduceByKey</p>
<p>Note that a narrow dependency is a transformationin which there is a 1 to 1 mapping between keys and values, whereas wide depenency is a transformationin which there is a many to 1 mapping</p>
FlatMap - Narrow Dependency
Map - Narrow Dependency
ReduceByKey - Wide Dependency