RDD Flashcards

Question 1

Q

How to create a RDD ?

Answer

A

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Question 2

Q

Command to parallelise a collection ?

Answer

A

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

// or with mum partition
val distData2 = sc.parallelize(data, 10)

Question 3

Q

What is the recommended number of partition for a RDD ?

Answer

A

Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

Question 4

Q

What is a partition ?

Answer

A

The number of slices the dataset is cut into. This conditions the amount of parallel work that can be done.

Question 5

Q

How to load RDD from text files ?

Answer

A

val distFile = sc.textFile("data.txt")
// return type: RDD[String]

Question 6

Q

Where can we load flat files from ?

Answer

A

sc.textFile(uri) takes a URI for the file, either a local path on the machine, or a hdfs://, s3n://, etc

Question 7

Q

What URI can be given to sc.textFile to load local files ?

Answer

A

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well.

For example, you can use:

textFile(“/my/directory”),
textFile(“/my/directory/*.txt”)
textFile(“/my/directory/*.gz”)

Question 8

Q

Describe the constraints on worker nodes when loading local files

Answer

A

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system

Question 9

Q

How does Spark assign numbers of partitions when loading text files ?

Answer

A

Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

Question 10

Q

What does SparkContext.wholeTextFiles do ?

Answer

A

SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

Question 11

Q

What does RDD.saveAsObjectFile do ?

Answer

A

RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

Question 12

Q

How to load data stored using Hadoop’s InputFormat ?

Answer

A

You can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

Question 13

Q

How to load SequenceFile ?

Answer

A

Use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.

Question 14

Q

What are the two operations supported by RDDs ?

Answer

A

Transformations, which create a new dataset from an existing one (e.g. map)

Actions, which return a value to the driver program after running a computation on the dataset (e.g. reduce)

Question 15

Q

What is the execution model of RDD Transformations ?

Answer

A

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

Question 16

Q

How to pin dataset in RAM ?

Answer

Study These Flashcards

A

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Question 17

Q

What are the 2 recommended ways to pass Functions to the Spark API ?

Answer

Study These Flashcards

A

Anonymous function syntax, which can be used for short pieces of code.

Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:
object MyFunctions {
  def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)

Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.