Spark Flashcards

1
Q

What are the differences between these 3 types of programming?

object-oriented
declarative
functional

A

object-oriented: Python, objects are mutable

declarative: SQL

functional: Spark, objects are immutable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is MapReduce?

A

The two parts of a simple distributed-computing framework.

“Map” functions/algorithms: should be easily parallelizable

“Reduce” functions/algorithms: should be commutative and associative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two types of Spark operations?

A

Transformations and Actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is Spark similar to git?

A

They both keep track of the “roadmap” of how the dataset gets from one state to the next. This is what makes Spark’s RDDs (Resilient Distributed Datasets) resilient: if one worker node breaks, the “instructions” for doing its job are still intact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why are RDDs resilient?

A

Because Spark keeps track of the “roadmap” of how the dataset gets from one state to the next. So if one worker node breaks, the “instructions” for doing its job are still intact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are key characteristics of Spark RDDs or Dataframes?

A
  • Immutable
  • Resilient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example imports/code to create a Spark dataframe from an RDD

A

Just these 2 lines needed to create an RDD
from pyspark import SparkContext
sc = SparkContext(“local[*]”, “temp”)

This needed for Dataframes
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = my_rdd.toDF()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does caching Spark dataframes work?

A

Add .cache() at the end of a chain of transformations.

The next action called after that will run all the transformations, including the caching.

The 2nd time, 3rd time, etc this new df is referenced with an action, it will be using its cached version & won’t have to recompute all the transformations again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do both RDDs and Dataframes optimize the order of transformations?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly