Revature Spark 2 Flashcards
What is Spark SQL?*
Spark SQL is a Spark module for structured data processing that allows querying data using SQL or the DataFrame API.
How does a broadcast join work in Spark?
A broadcast join sends a small dataset to all worker nodes, ensuring each node has the data locally to join with larger datasets.
Why are broadcast joins significantly faster than shuffle joins?
Broadcast joins avoid the costly shuffling of data across nodes by distributing the smaller dataset to each worker node.
How does Spark SQL evaluate a SQL query?
Spark SQL parses the query into a logical plan, optimizes it using the Catalyst optimizer, and converts it into a physical plan for execution.
What is the catalyst optimizer?
The Catalyst optimizer is Spark’s query optimization engine that generates efficient execution plans for SQL queries and DataFrame operations.
Why are there multiple APIs to work with Spark SQL?
Different APIs, like SQL, DataFrames, and Datasets, cater to various user needs, from SQL-like querying to programmatic manipulation.
What are DataFrames?
DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.
What is the SparkSession?
The SparkSession is the entry point to using Spark SQL and provides access to the Spark environment.
Can we access the SparkContext via a SparkSession?
Yes, the SparkContext can be accessed through the SparkSession using the .sparkContext attribute.
What other contexts are superseded by SparkSession?
The SparkSession replaces the SQLContext and HiveContext in newer Spark versions.
What are some data formats we can query with Spark SQL?
Spark SQL supports formats like Parquet, ORC, Avro, JSON, CSV, and more.
Are Dataframes lazily evaluated, like RDDs?
Yes, DataFrames are lazily evaluated, meaning transformations are not executed until an action is triggered.
List functions available to us when using DataFrames? * add examples
Functions include select, filter, groupBy, join, agg, withColumn, drop, sort, and many more.
What’s the difference between aggregate and scalar functions?
Aggregate functions operate on groups of rows to produce a single value, while scalar functions operate on individual values.
What is the return type of spark.sql(‘SELECT * FROM mytable’)?
The return type is a DataFrame.
How do you add a new column in Dataframes?
Use the withColumn function, e.g., df.withColumn(‘new_col’, some_expression).
How do you rename column in dataframe?
Use the withColumnRenamed function, e.g., df.withColumnRenamed(‘old_name’, ‘new_name’).
What is the difference between inner, outer left, outer right, and outer full joins?
Inner joins return matching rows, left outer joins return all rows from the left and matching ones from the right, right outer joins return all rows from the right and matching ones from the left, and full outer joins return all rows from both sides.
What is a cross join / cartesian join?* Exmaple
A cross join combines every row of one dataset with every row of another, resulting in the Cartesian product.
If I join two datasets with 10 records each, what is the maximum possible number of records in the output?
The maximum is 100, achieved through a cross join.
How many records would be in the output of a cross join/cartesian join?
The output would contain rows equal to the product of the number of rows in both datasets.
What is Parquet? ORC? Avro?
Parquet, ORC, and Avro are file formats for big data storage; Parquet and ORC are columnar, while Avro is row-based.
What does it mean that parquet is columnar storage?
Columnar storage means data is stored column by column, optimizing for analytical queries by reading only the required columns.
How can we partition files in Spark?
Partition files using partitionBy when writing DataFrames, e.g., df.write.partitionBy(‘column’).