Revature Spark 2 Flashcards
What is Spark SQL?*
Spark SQL is a Spark module for structured data processing that allows querying data using SQL or the DataFrame API.
How does a broadcast join work in Spark?
A broadcast join sends a small dataset to all worker nodes, ensuring each node has the data locally to join with larger datasets.
Why are broadcast joins significantly faster than shuffle joins?
Broadcast joins avoid the costly shuffling of data across nodes by distributing the smaller dataset to each worker node.
How does Spark SQL evaluate a SQL query?
Spark SQL parses the query into a logical plan, optimizes it using the Catalyst optimizer, and converts it into a physical plan for execution.
What is the catalyst optimizer?
The Catalyst optimizer is Spark’s query optimization engine that generates efficient execution plans for SQL queries and DataFrame operations.
Why are there multiple APIs to work with Spark SQL?
Different APIs, like SQL, DataFrames, and Datasets, cater to various user needs, from SQL-like querying to programmatic manipulation.
What are DataFrames?
DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.
What is the SparkSession?
The SparkSession is the entry point to using Spark SQL and provides access to the Spark environment.
Can we access the SparkContext via a SparkSession?
Yes, the SparkContext can be accessed through the SparkSession using the .sparkContext attribute.
What other contexts are superseded by SparkSession?
The SparkSession replaces the SQLContext and HiveContext in newer Spark versions.
What are some data formats we can query with Spark SQL?
Spark SQL supports formats like Parquet, ORC, Avro, JSON, CSV, and more.
Are Dataframes lazily evaluated, like RDDs?
Yes, DataFrames are lazily evaluated, meaning transformations are not executed until an action is triggered.
List functions available to us when using DataFrames? * add examples
Functions include select, filter, groupBy, join, agg, withColumn, drop, sort, and many more.
What’s the difference between aggregate and scalar functions?
Aggregate functions operate on groups of rows to produce a single value, while scalar functions operate on individual values.
What is the return type of spark.sql(‘SELECT * FROM mytable’)?
The return type is a DataFrame.
How do you add a new column in Dataframes?
Use the withColumn function, e.g., df.withColumn(‘new_col’, some_expression).
How do you rename column in dataframe?
Use the withColumnRenamed function, e.g., df.withColumnRenamed(‘old_name’, ‘new_name’).
What is the difference between inner, outer left, outer right, and outer full joins?
Inner joins return matching rows, left outer joins return all rows from the left and matching ones from the right, right outer joins return all rows from the right and matching ones from the left, and full outer joins return all rows from both sides.
What is a cross join / cartesian join?* Exmaple
A cross join combines every row of one dataset with every row of another, resulting in the Cartesian product.
If I join two datasets with 10 records each, what is the maximum possible number of records in the output?
The maximum is 100, achieved through a cross join.
How many records would be in the output of a cross join/cartesian join?
The output would contain rows equal to the product of the number of rows in both datasets.
What is Parquet? ORC? Avro?
Parquet, ORC, and Avro are file formats for big data storage; Parquet and ORC are columnar, while Avro is row-based.
What does it mean that parquet is columnar storage?
Columnar storage means data is stored column by column, optimizing for analytical queries by reading only the required columns.
How can we partition files in Spark?
Partition files using partitionBy when writing DataFrames, e.g., df.write.partitionBy(‘column’).
What are some benefits of storing your data in partitions?
Benefits include faster query performance, reduced data scanning, and parallel processing.
What is the lineage of an RDD?
The lineage of an RDD is its logical execution plan that tracks transformations to recover data if a partition is lost.
How can we see the Lineage?
Use the toDebugString method on an RDD to view its lineage.
What is the logical plan? The physical plan?*
The logical plan represents the high-level query structure, while the physical plan details the low-level execution strategy.
How many partitions does a single task work on?
A single task works on one partition at a time.
What’s the difference between cluster mode and client mode on YARN?
In cluster mode, the driver runs on a worker node, while in client mode, the driver runs locally.
What is an executor? What are executors when we run Spark on YARN?
Executors are distributed worker processes responsible for running tasks and storing data; on YARN, they are containers launched to execute Spark jobs.
What is AWS?
AWS (Amazon Web Services) is a cloud computing platform offering services like computing, storage, and databases.
EC2?
EC2 (Elastic Compute Cloud) provides scalable virtual servers for compute resources.
S3?
S3 (Simple Storage Service) is AWS’s object storage service for scalable, durable storage.
EMR?
EMR (Elastic MapReduce) is AWS’s managed service for big data processing frameworks like Spark and Hadoop.
What does it mean to run an EMR Step Execution?
Running an EMR Step Execution means executing a single processing job or set of tasks, like a Spark job, as part of an EMR cluster workflow.
What is the Spark History Server?
The Spark History Server provides a web UI to view and analyze completed Spark jobs.
What does it mean to ‘spill to disk’ when executing spark tasks?
Spilling to disk occurs when Spark’s in-memory data exceeds allocated memory, forcing it to write data to disk.
When during a Job do we need to pay attention to the number of partitions and adjust if necessary?
Adjust partitions when data skew or poor parallelism is observed, often during shuffles or key aggregations.
What is spark.driver.memory? What about spark.executor.memory?
spark.driver.memory specifies memory for the driver process, while spark.executor.memory specifies memory for each executor.
What is a Spark Application? Job? Stage? Task?
A Spark Application is a complete user program, a Job is a set of parallel transformations, a Stage is a set of tasks for one computation step, and a Task is the smallest unit of work.
When we cache an RDD?
We cache an RDD to reuse it in multiple computations, improving performance by avoiding recomputation.
What are Persistence Storage Levels in Spark?
Persistence Storage Levels define how RDDs or DataFrames are stored in memory and/or disk.
Some levels have _SER, what does this mean?
_SER indicates that data is serialized to reduce memory usage.
Some levels have _2, what does this mean?
_2 means two copies of data are stored for fault tolerance.
If the storage level for a persist is MEMORY_ONLY and there isn’t enough memory, what happens?
The data that doesn’t fit in memory is recomputed as needed.
What is the storage level for .cache()?
The storage level for .cache() is MEMORY_AND_DISK.
Different ways to repartition?
Use repartition() or coalesce() for adjusting the number of partitions.
How to check the storagelevel
Use the .storageLevel attribute on an RDD or DataFrame.
What is ETL?
ETL (Extract, Transform, Load) is the process of extracting data, transforming it into a usable format, and loading it into a destination.
What is Audit, Balance and control?
Audit ensures data accuracy
, Balance verifies data integrity, and
Control maintains proper data governance.
What is a data warehouse?
A data warehouse is a centralized repository for storing structured data for reporting and analysis.
Give an example of a slowly changing dimension.
Customer address history is an example of a slowly changing dimension.
What about scd type2?
SCD Type 2 tracks changes by creating a new record for each change with a version or effective date.
Difference between OLTP vs OLAP?
OLTP (Online Transaction Processing) is optimized for real-time transactions, while OLAP (Online Analytical Processing) is optimized for analytical queries.