Spark Structured API Flashcards

Question 1

Q

What is the problem with Resilient Distributed Datasets (RDDs)?

Answer

A

Resilient Distributed Datasets are low-level objects that are hard to understand and maintain when coding Spark jobs. Efficiency is also an issue as RDD’s are executed directly without optimization.

Question 2

Q

What is the Spark SQL Module?

Answer

A

Adds structure to data with high-level API’s: Dataframes, Datasets and SQL.

Provides benefits like better readability, type checking and faster execution.

Works with Catalyst Optimizer for performance improvements.

Question 3

Q

What are Resilient Distributed Datasets (RDD)?

Answer

A

Low-level API with manual optimization, code is harder to read and maintain.

Question 4

Q

What are DataFrames?

Answer

A

Tables with schema (columns and rows), distributed structure inspired by Python and R, commonly used due to its simplicity and performance.

Question 5

Q

What are DataSets?

Answer

A

Adds type safety by working with domain-specific types, preferred when strict control over data and compile-time error checking are required.

Question 6

Q

What is Catalyst Optimizer?

Answer

A

Analyzes and optimizes structured queries and converts DataFrames, DataSets and SQL into optimized RDD’s.

Question 7

Q

What is type safety?

Answer

A

DataFrames: no compile-time type checking; runtime errors are possible.
DataSets: Type-safe; compile-time errors prevent invalid operations.

Question 8

Q

Why are Structured API’s (DataFrames/DataSets/SQL) more optimized than RDD’s?

Answer

A

Catalyst Optimizer.

Question 9

Q

Spark Structured API Flashcards

(9 cards)