RDD Flashcards
How to create a RDD ?
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Command to parallelise a collection ?
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
// or with mum partition val distData2 = sc.parallelize(data, 10)
What is the recommended number of partition for a RDD ?
Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
What is a partition ?
The number of slices the dataset is cut into. This conditions the amount of parallel work that can be done.
How to load RDD from text files ?
val distFile = sc.textFile("data.txt") // return type: RDD[String]
Where can we load flat files from ?
sc.textFile(uri) takes a URI for the file, either a local path on the machine, or a hdfs://, s3n://, etc
What URI can be given to sc.textFile to load local files ?
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well.
For example, you can use:
- textFile(“/my/directory”),
- textFile(“/my/directory/*.txt”)
- textFile(“/my/directory/*.gz”)
Describe the constraints on worker nodes when loading local files
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system
How does Spark assign numbers of partitions when loading text files ?
Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
What does SparkContext.wholeTextFiles do ?
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.
What does RDD.saveAsObjectFile do ?
RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
How to load data stored using Hadoop’s InputFormat ?
You can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).
How to load SequenceFile ?
Use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.
What are the two operations supported by RDDs ?
Transformations, which create a new dataset from an existing one (e.g. map)
Actions, which return a value to the driver program after running a computation on the dataset (e.g. reduce)
What is the execution model of RDD Transformations ?
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.