Spark Structured API Flashcards
What is the problem with Resilient Distributed Datasets (RDDs)?
Resilient Distributed Datasets are low-level objects that are hard to understand and maintain when coding Spark jobs. Efficiency is also an issue as RDD’s are executed directly without optimization.
What is the Spark SQL Module?
Adds structure to data with high-level API’s: Dataframes, Datasets and SQL.
Provides benefits like better readability, type checking and faster execution.
Works with Catalyst Optimizer for performance improvements.
What are Resilient Distributed Datasets (RDD)?
Low-level API with manual optimization, code is harder to read and maintain.
What are DataFrames?
Tables with schema (columns and rows), distributed structure inspired by Python and R, commonly used due to its simplicity and performance.
What are DataSets?
Adds type safety by working with domain-specific types, preferred when strict control over data and compile-time error checking are required.
What is Catalyst Optimizer?
Analyzes and optimizes structured queries and converts DataFrames, DataSets and SQL into optimized RDD’s.
What is type safety?
DataFrames: no compile-time type checking; runtime errors are possible.
DataSets: Type-safe; compile-time errors prevent invalid operations.
Why are Structured API’s (DataFrames/DataSets/SQL) more optimized than RDD’s?
Catalyst Optimizer.