SPARK Basics Flashcards

Question 1

Q

Spark Context

Answer

A

Question 2

Q

RDDs
(Resilient Distributed Datasets)

Answer

A

RDD (Resilient Distributed Dataset)
- Resilient - if data in memory is lost, it can be recreated
- Distributed - processed across the cluster
- Dataset - initial data can come from a file or be created programmatically
RDDs are fundamental unit of data in Spark
Most Spark programming consists of performing operations on RDDs

Question 3

Q

Creating an RDD

Answer

A

Three ways to create an RDD
- From a file or set of files
- From data in memory
- From another RDD

Question 4

Q

A file based RDD

Answer

A

> val mydata = sc.textFile(“purplecow.txt”)
> mydata.count()

Question 5

Q

RDD Operations

Answer

A

Question 6

Q

RDD Operations: ACTIONS

Answer

A

Some common actions:
- count() - returns the number of elements
- take(n) - return an array of the first n elements
- collect() - return an array of all elements
- saveAsTextFile(file) - save to text file(s)

Example:

for (line <- mydata.take(2))
println(line)

Question 7

Q

RDD Operations: TRANSFORMATION

Answer

A

Transformations create a new RDD from an existing one
RDDs are immutable
- Data in an RDD is never changed
- Transform in sequence to modify the data as needed
Some common transformations
- map(function) - creates a new RDD by performing a function on each record in the base RDD
- filter(function) - creates a new RDD by including or excluding each record in the base RDD according to a boolean function

(7 cards)