Spark Structured API Flashcards

1
Q

What is the problem with Resilient Distributed Datasets (RDDs)?

A

Resilient Distributed Datasets are low-level objects that are hard to understand and maintain when coding Spark jobs. Efficiency is also an issue as RDD’s are executed directly without optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Spark SQL Module?

A

Adds structure to data with high-level API’s: Dataframes, Datasets and SQL.

Provides benefits like better readability, type checking and faster execution.

Works with Catalyst Optimizer for performance improvements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Resilient Distributed Datasets (RDD)?

A

Low-level API with manual optimization, code is harder to read and maintain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are DataFrames?

A

Tables with schema (columns and rows), distributed structure inspired by Python and R, commonly used due to its simplicity and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are DataSets?

A

Adds type safety by working with domain-specific types, preferred when strict control over data and compile-time error checking are required.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Catalyst Optimizer?

A

Analyzes and optimizes structured queries and converts DataFrames, DataSets and SQL into optimized RDD’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is type safety?

A

DataFrames: no compile-time type checking; runtime errors are possible.
DataSets: Type-safe; compile-time errors prevent invalid operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are Structured API’s (DataFrames/DataSets/SQL) more optimized than RDD’s?

A

Catalyst Optimizer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly