In Action Flashcards

Question 1

Q

What are the components of spark

Answer

A

Spark Core, Spark SQL, Spark Streaming, Spark GraphX, and Spark MLlib

Question 2

Q

RDD

Answer

A

Resilient Distributed Dataset

Question 3

Q

How does spark streaming use DStreams

Answer

A

Uses DStreams to periodically create RDDs

Question 4

Q

Spark Mlib models use ______ to represent data

Answer

A

Dataframes

Question 5

Q

What are the datasources for SparkSQL

Answer

A

Relational Databases
No SQL Databases
Hive
JSON
Parquet Files
Protocol Buffer

Question 6

Q

What does DStreams stand for

Answer

A

Discretised streams

Question 7

Q

What are broadcast variables

Answer

A

Variables that are sent to all of the executors so that they are there when they are needed

Question 8

Q

How do you ship off broadcast variables

Answer

A

Just use sc.broadcast()

Question 9

Q

How do you retrieve broadcast variables

Answer

A

use .value on the object that is returned by sc.broadcast

Question 10

Q

If you want use map to parse a group of lines and there is a chance that some lines are missing what would you do

Answer

A

You would use flatmap

Question 11

Q

What are the different deploy modes for the spark standalone cluster

Answer

A

cluster-deploy mode - the driver is on the cluster

client-deploy mode - the driver is on the client

Question 12

Q

Given an RDD lines consisting of the tuple (stationId, entryType, temperature) where entryType is one of three values (TMIN, TMAX or TAVG) how would you create an rdd with only TMIN

Answer

A

mins = lines.filter(x => x._2 == “TMIN”)

Question 13

Q

Given an RDD lines consisting of the tuple (stationId, entryType, temperature)
How would you return a new RDD with only stationId and temperature)

Answer

A

newRdd = lines.map(x => (x._1, x._3.toFloat))

Question 14

Q

given the key value pair stationTemps consisting of a stationID and a temperature, how would you compute the minimum temperature for each station

Answer

A

stationTemps.reduceByKey((x,y) => min(x,y))

Question 15

Q

If you want a wordcount job to run on a cluster - why would you not use the scala function countByValue

Answer

A

It returns a scala map. You want it to return an RDD. You want to use ReduceByKey rather than CountByValue.

Question 16

Q

what would this function do?

lowercaseWords.map(x =>(x,1))

Answer

Study These Flashcards

A

It would map each word to a key value pair where each key value pair consists of the word and the number 1

In Action Flashcards

(16 cards)