RDD Flashcards

1
Q

How to create a RDD ?

A

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Command to parallelise a collection ?

A

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

// or with mum partition
val distData2 = sc.parallelize(data, 10)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the recommended number of partition for a RDD ?

A

Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a partition ?

A

The number of slices the dataset is cut into. This conditions the amount of parallel work that can be done.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to load RDD from text files ?

A
val distFile = sc.textFile("data.txt")
// return type: RDD[String]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where can we load flat files from ?

A

sc.textFile(uri) takes a URI for the file, either a local path on the machine, or a hdfs://, s3n://, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What URI can be given to sc.textFile to load local files ?

A

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well.

For example, you can use:

  • textFile(“/my/directory”),
  • textFile(“/my/directory/*.txt”)
  • textFile(“/my/directory/*.gz”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the constraints on worker nodes when loading local files

A

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does Spark assign numbers of partitions when loading text files ?

A

Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does SparkContext.wholeTextFiles do ?

A

SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does RDD.saveAsObjectFile do ?

A

RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to load data stored using Hadoop’s InputFormat ?

A

You can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to load SequenceFile ?

A

Use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two operations supported by RDDs ?

A

Transformations, which create a new dataset from an existing one (e.g. map)

Actions, which return a value to the driver program after running a computation on the dataset (e.g. reduce)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the execution model of RDD Transformations ?

A

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to pin dataset in RAM ?

A

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

17
Q

What are the 2 recommended ways to pass Functions to the Spark API ?

A

Anonymous function syntax, which can be used for short pieces of code.

Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:
object MyFunctions {
  def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)

Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method.

18
Q

What does rdd.collect() do ?

A

Brings all elements of a RDD to the driver. This can cause the driver to run out of memory,

19
Q

What does RDD.take(100) do ?

A

Returns the first 100 element of the RDD

20
Q

What is a RDD ?

A

A fault-tolerant collection of elements that can be operated on in parallel.