Data Processing with Spark: Dataframe -1 Flashcards

Question 1

Q

What does RDD stand for in Apache Spark?

Answer

A

RDD stands for Resilient Distributed Dataset.

Question 2

Q

What is the key characteristic of RDDs that makes them resilient to failures?

Answer

A

RDDs are fault-tolerant, meaning they can automatically recover from failures by using lineage information to rebuild lost partitions.

Question 3

Q

Describe the immutability of RDDs in Spark.

Answer

A

Once created, RDDs cannot be modified. Any operation on an RDD creates a new RDD.

Question 4

Q

What is lazy evaluation in the context of RDDs?

Answer

A

Lazy evaluation means that transformations on RDDs are not executed immediately. Instead, Spark builds up a directed acyclic graph (DAG) of the transformations and performs them only when an action is invoked.

Question 5

Q

Name two types of operations that can be performed on RDDs in Spark.

Answer

A

Transformations and Actions.

Question 6

Q

Provide an example of a transformation operation in Spark RDDs.

Answer

A

An example of a transformation is the map operation, which applies a function to each element of the RDD and returns a new RDD.

Question 7

Q

What is the purpose of actions in Spark RDDs?

Answer

A

Actions in Spark RDDs trigger the execution of the RDD computations and return results to the driver program or write data to external storage systems.

Question 8

Q

How are RDDs distributed across a cluster in Apache Spark?

Answer

A

RDDs are distributed across multiple nodes in a cluster to enable parallel processing.

Question 9

Q

Can RDDs hold objects of any type?

Answer

A

Yes, RDDs can hold objects of any type, including user-defined classes.

Question 10

Q

What is the primary use case of RDDs in Apache Spark?

Answer

A

RDDs are the foundational data structure in Spark and are used for distributed data processing tasks, providing fault tolerance and parallel execution capabilities.

Data Processing with Spark: Dataframe -1 Flashcards

(10 cards)