Data Processing with Spark: Dataframe -1 Flashcards
What does RDD stand for in Apache Spark?
RDD stands for Resilient Distributed Dataset.
What is the key characteristic of RDDs that makes them resilient to failures?
RDDs are fault-tolerant, meaning they can automatically recover from failures by using lineage information to rebuild lost partitions.
Describe the immutability of RDDs in Spark.
Once created, RDDs cannot be modified. Any operation on an RDD creates a new RDD.
What is lazy evaluation in the context of RDDs?
Lazy evaluation means that transformations on RDDs are not executed immediately. Instead, Spark builds up a directed acyclic graph (DAG) of the transformations and performs them only when an action is invoked.
Name two types of operations that can be performed on RDDs in Spark.
Transformations and Actions.
Provide an example of a transformation operation in Spark RDDs.
An example of a transformation is the map operation, which applies a function to each element of the RDD and returns a new RDD.
What is the purpose of actions in Spark RDDs?
Actions in Spark RDDs trigger the execution of the RDD computations and return results to the driver program or write data to external storage systems.
How are RDDs distributed across a cluster in Apache Spark?
RDDs are distributed across multiple nodes in a cluster to enable parallel processing.
Can RDDs hold objects of any type?
Yes, RDDs can hold objects of any type, including user-defined classes.
What is the primary use case of RDDs in Apache Spark?
RDDs are the foundational data structure in Spark and are used for distributed data processing tasks, providing fault tolerance and parallel execution capabilities.