Module 8a - Apache Spark Flashcards

Question 1

Q

What is Apache Spark?

Answer

A

Apache Spark is an open-source framework/tool used for large-scale data processing.

Question 2

Q

Where is Spark data stored?

| What language is it written in?

Answer

A

Spark data is stored in-memory

- Written in Scala, runs on JVM

Question 3

Q

What is a benefit of Spark being run on JVM?

Answer

A

All the Java methods can be invoked and used natively

- All other benefits from using a VM to run a program (safety, isolation, replicability, etc)

Question 4

Q

What is a Resilient Distributed Data set (RDD)? What does it represent? Is it mutable?

Answer

A

RDD is the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel

Question 5

Q

RDDs (Resilient Distributed Dataset) are data structures. what are their key characteristics?

Answer

A

Fault tolerant
Immutable
Is Partitioned (is a list of items)
Optimized to be operated on in parallel
Lets users explicitly persist intermediate results in memory

Question 6

Q

What was the problem with Hadoop MapReduce that lead to the development of Spark?

Answer

A

Hadoop MR lacked abstractions for leveraging distributed memory.
 This made it inefficient for applications that reuse intermediate results across multiple computations.

Question 7

Q

What is lineage with respect to RDD?

Answer

A

How an RDD was derived from other RDDs. It is used to rebuild an RDD in the event of a failure

(direct definition)

Question 8

Q

What are Transformations in Apache Spark?

Answer

A

Data operations that convert one RDD or a pair of RDDs into another RDD

Question 9

Q

What are "Actions" in Spark?

Answer

A

Data operations that convert an RDD into an output. Ex: number or a sequence of value

Question 10

Q

What happens with the Spark scheduler when an action is invoked on an RDD?

Answer

A

The Spark Scheduler examines the lineage graph of the RDD and builds a directed acyclic graph (DAG) of transformations

Question 11

Q

After the Spark Scheduler builds a DAGof transformations, they are further grouped into "stages". What is a "stage"?

Answer

A

A Stage is a collection of transformations with narrow dependencies, meaning that one partition of the output depends only on one partition of each input. Ex: filtering

Question 12

Q

In Apache Spark what are characteristics ofnarrow dependencies?

What are characteristics of of wide dependencies?

Answer

A

Narrow Dependencies:

One partition of the output depends on only one partition of each input (eg: map)
Transformations with narrow dependencies can be pipelined together effectively

Wide Dependencies:

One partition of the output depends on multiple partitions of some input (eg: groupByKey)
Transformations with wide dependencies require a shuffle

Question 13

Q

Describe how Spark typicallyexecutesof a job in stages while leveraging pipelining. How does the program go from initial RDDs, to intermediate RDDs, to a final RDD?

What is the benefit of executing jobs in stages on throughput?

Answer

A

We can define initial RDDs, intermediate RDDs (which lead towards to final) and a final RDD. Narrow and Wide dependency transformations must be applied to the initial RDDs to bring them towards the intermediate RDDs and then eventually towards the final RDDs.

These transformations are broken down into stages, which are pipelined (run in parallel) - significantly improving throughput

Question 14

Q

Describehow the word count problem (get the frequency of every word in a file) can be done in Scala
How the code look like?

Answer

A

Build an RDD using data in HDFS
Transform textFile to counts by applying a sequence of the following operators:
- toeknize each line of input
-map each word to the tuple (word, 1)
- reduce the word counts
Dump the output back to HDFS

Question 15

Q

Classify the following transformations as narrow or wide dependency:

- FlatMap

- Map

- ReduceByKey

Note that a narrow dependency is a transformationin which there is a 1 to 1 mapping between keys and values, whereas wide depenency is a transformationin which there is a many to 1 mapping

Answer

A

FlatMap - Narrow Dependency
Map - Narrow Dependency
ReduceByKey - Wide Dependency

Question 16

Q

What does a FlatMap transformationdo?

Answer

A

Returns a new array formed by applying a given callback function to each element of the array, and then flattening the result by one level.
Equivalent to calling map() then calling flat().
flat() takes everything in subarrays and moves them up alevel (into the parent array)

Question 17

Q

What is the Scala placeholder Syntax?

Answer

A

Transformations are often specified using anonymous functions, which use Scala’s convenient placeholder syntax (“_” notation)

For example, “_ + 1” turns into “x => x+1”

Question 18

Q

In Scala what can thedata type ofan element of an RDD be?

Answer

A

An element of an RDD can be a single item, an array of items, or a tuple (can be any user-defined class)

Question 19

Q

Can you run Spark without Scala?

Answer

A

No. You need Scala. Spark is an API for Scala

Question 20

Q

In Scala, how can the underscore token be used with a tuple?

Answer

A

The underscore is used to select different elements of a tuple:

var t = (4,3,2)
var sum = t._1 + t._2 + t._3

Question 21

Q

In scala what are round brackets used for in the conext of arrays? How do you remove the first `n` elements of an array.list in scala?

Answer

A

Round brackets are used to select different elements of an array

<p>  val myarr = Array(4,3,2)
</p><p>  if (myarr(0) < myarr(1)) {...}
Use drop(n) to remove the first `n` elements of an array/list</p>

Question 22

Q

What is the Purpose of PageRank?
How does it work on a high level?
What is the idea behind it?
What is the amount of iterations in PageRank?

Answer

A

PageRank is used to rank websites/nodes/documents to a number which measures the importance of it. It works iteratively, with an arbitrary number of iterations

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites/documents/nodes.

Question 23

Q

What is the purpose of the damping factor "d" in Scala PageRank?

Answer

A

The probability at any step that the agent which is traversing the graph of connected documents will continue is the damping factor “d”, typically = 0.85.

This is in order to guarantee convergence. The number of iterations can go onto infinity since rankings can always be updated.Damping stabilizes the system ensures that the algorithm will eventually converge.

Question 24

Q

In PageRank, when the algorithm begins, what are the rankings typically initialized to?

Answer

A

All rankings in the collection of nodes/documents are initialized to 1/N, where N is the number of nodes/documents. If X is a node in the collection of documents, the PR(X) = 1/N for all X

Question 25

Q

Suppose we're dealing with PageRank, andall the documents have been initialized. In each iteration, each ranking of each node is updated.What is the formula that(PR[X])updated against?

Assume that the neighbors/connections of X are known

Answer

A

PR(X) = (1-d)/N + d*(SUM r/n FOR EACH NEIGHBOR)

d = damping factor, typically 0.85

N = number of nodes in the network

r = the ranking of thatneighbor (this depends on the neighbor)

n = the number of neighbors that this neighbor has (this depends on the neighbour)

Question 26

Q

How is the PageRank implemented in Scala?

Answer

A

- The nodes are loaded from the file system using HDFS

- The ranks are loaded into an RDD data structure

- A loop is run topopulate the RDD data structure with the default ranks

- In the loop, each node in the RDD data structure is updated iteratively one by one with the contributions of each neighboring page

- Also in the loop, the damping constant (1-d)/N is added to each value in the RDD data structure

Question 27

Q

Describe the Lineage graph for PageRank implemented in Spark. Remeber that a lineage graph is a DAG showing how RDDs are inherited from other RDDs thorugh transformations

Question 28

Q

PageRank can be implemented using Hadoop as well as Spark. It is typically done in Spark though. What are 2weaknesses of using Hadoop for PageRank? Why is Spark better?

Answer

A

Hadoop weaknesses:

- The intermediate output is dumped to HDFS after each iteration, because each iteration requires a seperate MapReduce process. This is very I/O intensive and inefficient

- If a convergence test is used to terminate the iterative loop instead of computing for a fixed number of iterations, then an additional MapReduce program is required in each iteration to evaluate the convergence condition.

Why Spark is better:

Spark is able to cache intermediate data (nodes/documents & ranks) in main memory provided that it has sufficient resources

Question 29

Q

What does the ".persist()" flag added to the end of an RDDvariablesignify in Scala?

Answer

A

It requests that Spark holds data (for PageRank, this would be the links/documents) in main memory, in RDD.

As such, when ".persist()" is added, the outputs of transformations are cached automatically (subject to resource limits) in main memory