Data Processing with Spark: Dataframe -1 Flashcards

1
Q

What does RDD stand for in Apache Spark?

A

RDD stands for Resilient Distributed Dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the key characteristic of RDDs that makes them resilient to failures?

A

RDDs are fault-tolerant, meaning they can automatically recover from failures by using lineage information to rebuild lost partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the immutability of RDDs in Spark.

A

Once created, RDDs cannot be modified. Any operation on an RDD creates a new RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is lazy evaluation in the context of RDDs?

A

Lazy evaluation means that transformations on RDDs are not executed immediately. Instead, Spark builds up a directed acyclic graph (DAG) of the transformations and performs them only when an action is invoked.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name two types of operations that can be performed on RDDs in Spark.

A

Transformations and Actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Provide an example of a transformation operation in Spark RDDs.

A

An example of a transformation is the map operation, which applies a function to each element of the RDD and returns a new RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of actions in Spark RDDs?

A

Actions in Spark RDDs trigger the execution of the RDD computations and return results to the driver program or write data to external storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are RDDs distributed across a cluster in Apache Spark?

A

RDDs are distributed across multiple nodes in a cluster to enable parallel processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can RDDs hold objects of any type?

A

Yes, RDDs can hold objects of any type, including user-defined classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the primary use case of RDDs in Apache Spark?

A

RDDs are the foundational data structure in Spark and are used for distributed data processing tasks, providing fault tolerance and parallel execution capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly