Spark Flashcards
What are the differences between these 3 types of programming?
object-oriented
declarative
functional
object-oriented: Python, objects are mutable
declarative: SQL
functional: Spark, objects are immutable
What is MapReduce?
The two parts of a simple distributed-computing framework.
“Map” functions/algorithms: should be easily parallelizable
“Reduce” functions/algorithms: should be commutative and associative.
What are the two types of Spark operations?
Transformations and Actions
How is Spark similar to git?
They both keep track of the “roadmap” of how the dataset gets from one state to the next. This is what makes Spark’s RDDs (Resilient Distributed Datasets) resilient: if one worker node breaks, the “instructions” for doing its job are still intact.
Why are RDDs resilient?
Because Spark keeps track of the “roadmap” of how the dataset gets from one state to the next. So if one worker node breaks, the “instructions” for doing its job are still intact.
What are key characteristics of Spark RDDs or Dataframes?
- Immutable
- Resilient
Example imports/code to create a Spark dataframe from an RDD
Just these 2 lines needed to create an RDD
from pyspark import SparkContext
sc = SparkContext(“local[*]”, “temp”)
This needed for Dataframes
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = my_rdd.toDF()
How does caching Spark dataframes work?
Add .cache() at the end of a chain of transformations.
The next action called after that will run all the transformations, including the caching.
The 2nd time, 3rd time, etc this new df is referenced with an action, it will be using its cached version & won’t have to recompute all the transformations again.
Do both RDDs and Dataframes optimize the order of transformations?
Yes.