SPARK Basics Flashcards
1
Q
Spark Context
A
- Every Spark application requires a Spark Context
- Spark Shell provides a preconfigured Spark Context called sc
2
Q
RDDs
(Resilient Distributed Datasets)
A
-
RDD (Resilient Distributed Dataset)
- Resilient - if data in memory is lost, it can be recreated
- Distributed - processed across the cluster
- Dataset - initial data can come from a file or be created programmatically
- RDDs are fundamental unit of data in Spark
- Most Spark programming consists of performing operations on RDDs
3
Q
Creating an RDD
A
-
Three ways to create an RDD
- From a file or set of files
- From data in memory
- From another RDD
4
Q
A file based RDD
A
> val mydata = sc.textFile(“purplecow.txt”)
> mydata.count()
5
Q
RDD Operations
A
- Transformation: define a new RDD based on the current one(s)
- Actions: return value
6
Q
RDD Operations: ACTIONS
A
- Some common actions:
- count() - returns the number of elements
- take(n) - return an array of the first n elements
- collect() - return an array of all elements
- saveAsTextFile(file) - save to text file(s)
Example:
for (line <- mydata.take(2))
println(line)
7
Q
RDD Operations: TRANSFORMATION
A
- Transformations create a new RDD from an existing one
- RDDs are immutable
- Data in an RDD is never changed
- Transform in sequence to modify the data as needed
- Some common transformations
- map(function) - creates a new RDD by performing a function on each record in the base RDD
- filter(function) - creates a new RDD by including or excluding each record in the base RDD according to a boolean function