Spark Flashcards

Question 1

Q

What are the differences between these 3 types of programming?

object-oriented
declarative
functional

Answer

A

object-oriented: Python, objects are mutable

declarative: SQL

functional: Spark, objects are immutable

Question 2

Q

What is MapReduce?

Answer

A

The two parts of a simple distributed-computing framework.

“Map” functions/algorithms: should be easily parallelizable

“Reduce” functions/algorithms: should be commutative and associative.

Question 3

Q

What are the two types of Spark operations?

Answer

A

Transformations and Actions

Question 4

Q

How is Spark similar to git?

Answer

A

They both keep track of the “roadmap” of how the dataset gets from one state to the next. This is what makes Spark’s RDDs (Resilient Distributed Datasets) resilient: if one worker node breaks, the “instructions” for doing its job are still intact.

Question 5

Q

Why are RDDs resilient?

Answer

A

Because Spark keeps track of the “roadmap” of how the dataset gets from one state to the next. So if one worker node breaks, the “instructions” for doing its job are still intact.

Question 6

Q

What are key characteristics of Spark RDDs or Dataframes?

Answer

A

Immutable
Resilient

Question 7

Q

Example imports/code to create a Spark dataframe from an RDD

Answer

A

Just these 2 lines needed to create an RDD
from pyspark import SparkContext
sc = SparkContext(“local[*]”, “temp”)

This needed for Dataframes
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = my_rdd.toDF()

Question 8

Q

How does caching Spark dataframes work?

Answer

A

Add .cache() at the end of a chain of transformations.

The next action called after that will run all the transformations, including the caching.

The 2nd time, 3rd time, etc this new df is referenced with an action, it will be using its cached version & won’t have to recompute all the transformations again.

Question 9

Q

Do both RDDs and Dataframes optimize the order of transformations?

Spark Flashcards

(9 cards)