In Action Flashcards
What are the components of spark
Spark Core, Spark SQL, Spark Streaming, Spark GraphX, and Spark MLlib
RDD
Resilient Distributed Dataset
How does spark streaming use DStreams
Uses DStreams to periodically create RDDs
Spark Mlib models use ______ to represent data
Dataframes
What are the datasources for SparkSQL
- Relational Databases
- No SQL Databases
- Hive
- JSON
- Parquet Files
- Protocol Buffer
What does DStreams stand for
Discretised streams
What are broadcast variables
Variables that are sent to all of the executors so that they are there when they are needed
How do you ship off broadcast variables
Just use sc.broadcast()
How do you retrieve broadcast variables
use .value on the object that is returned by sc.broadcast
If you want use map to parse a group of lines and there is a chance that some lines are missing what would you do
You would use flatmap
What are the different deploy modes for the spark standalone cluster
cluster-deploy mode - the driver is on the cluster
client-deploy mode - the driver is on the client
Given an RDD lines consisting of the tuple (stationId, entryType, temperature) where entryType is one of three values (TMIN, TMAX or TAVG) how would you create an rdd with only TMIN
mins = lines.filter(x => x._2 == “TMIN”)
Given an RDD lines consisting of the tuple (stationId, entryType, temperature)
How would you return a new RDD with only stationId and temperature)
newRdd = lines.map(x => (x._1, x._3.toFloat))
given the key value pair stationTemps consisting of a stationID and a temperature, how would you compute the minimum temperature for each station
stationTemps.reduceByKey((x,y) => min(x,y))
If you want a wordcount job to run on a cluster - why would you not use the scala function countByValue
It returns a scala map. You want it to return an RDD. You want to use ReduceByKey rather than CountByValue.