Spark Flashcards
Can you explain what happens when a PySpark program is executed?
When a PySpark program is submitted, the first step is to construct a logical execution plan, represented as a DAG. This DAG captures the sequence of transformations and actions defined in the program.
What is a stage is within the context of PySpark?
A stage is a collection of tasks that can be executed in parallel. PySpark optimizes the execution of transformations by dividing them into stages based on data dependencies. Stages are typically delineated by shuffle operations, such as joins or aggregations, where data needs to be exchanged between partitions.
How does PySpark determine the execution plan and break it down into stages?
PySpark’s Catalyst optimizer analyzes the logical execution plan and applies optimizations to generate an optimized physical execution plan. This plan is then broken down into stages based on data dependencies and transformations that can be performed in parallel.
What role do tasks play in the execution of a PySpark program?
Tasks are the smallest units of work in PySpark. Each stage consists of one or more tasks that are executed on partitions of the input data. Tasks perform transformations or actions on the data and are executed in parallel across the cluster.
Walk through the sequence of events that occur when a PySpark program is submitted?
When a PySpark program is submitted, the driver node creates a SparkContext, which coordinates the execution of the program. The program defines a series of transformations and actions on RDDs or DataFrames, which are translated into a logical execution plan represented as a DAG. This DAG is optimized by the Catalyst optimizer to generate an optimized physical execution plan. The plan is then broken down into stages, each consisting of tasks that are executed in parallel across the cluster.
What would be your initial steps in troubleshooting a slow Spark job?
Check the Spark UI to gather information about the job’s execution, including task progress, stage durations, and resource utilization. This can help identify bottlenecks and performance issues within the job.
What are some specific metrics or indicators you’d look for in the Spark UI when analysing a slow job?
Task duration
Shuffle read/write times
Executor CPU
Memory utilization
Garbage collection activity
These metrics can provide insights into potential performance bottlenecks, such as data skew, resource contention, or inefficient task execution.
What steps would you take if you suspect resource contention as the cause of the slowdown?
Examine executor CPU and memory utilization to identify any resource bottlenecks.
Increasing executor memory or adjusting the number of executors can help alleviate resource contention and improve job performance.
Additionally, optimizing resource allocation and task scheduling parameters in the Spark configuration can further optimize resource utilization.
Suppose you’ve optimized resource utilization, but the job is still running slower than expected. What other factors would you consider?
Investigate inefficient data processing logic, data skew, and suboptimal partitioning strategies.
Analyzing the job’s DAG and execution plan can provide insights into the data processing flow and identify opportunities for optimization.
How would you ensure the stability and reliability of a Spark job after optimization?
After troubleshooting and optimization, validate the stability and reliability of the Spark job under varying workload conditions and data scenarios. This includes performance testing, stress testing, and fault tolerance testing.
How do you troubleshoot memory-related issues in Spark jobs?
Check logs and WebUI
Tune Executor and Driver Memory, and Garbage Collection
Optimise Partitions / Data Skew
Look for large intermediate results
Avoid shuffling and wide transforms (reduceByKey and join)
What options are available in the WebUI for Memory related tuning?
Storage Tab: Shows RDDs cached in memory, their size, and their storage levels.
Stages Tab: Shows task-level memory usage during execution.
Executors Tab: Displays memory and garbage collection statistics for each executor.
Explain how Spark DAG works. Can you explain what happens under the hood?
A DAG is a data structure representing a series of tasks are their dependancies.
The DAG utilises lazy execution. That is, operations are defined, but not executed until an action like count() is triggered.
Step 1: DAG Creation
A logical plan of all required transformations.
Step 2: Stage Division
Group by shuffles. DAG is divided into stages usually based on wide dependencies, like groupBy and reduceByKey.
Step 3: Task Creation
Each stage is divided into multiple tasks based on data partitions. Each task operates on a partition of data, and tasks can run in parallel.
Step 4: Task Scheduling and Execution
Spark submits tasks to the cluster’s cluster manager
Step 5: Result Computation
After tasks finish executing, the results are sent back to the driver.
What is the difference between narrow and wide transformations?
Narrow transformations, like map(), fliter() and union(), operate on a single partition of data. They don’t require any data to be shuffled across partitions.
Wide transformations, like join() and groupByKey(), require data to be shuffled between partitions.
How to best handle small files in Spark?
Processing lots of small files has a lot of overheads in task scheduling and launching, and potentially lots of shuffling and under utilisation of executors.
Solutions:
Merge files before processing, if you can, maybe covnerting them to a columnar-based format too.
Maybe use coalesce() to consolidate files into fewer partitions.
Increase the spark.sql.files.maxPartitionBytes setting to allow Spark to read larger chunks of data.
How do you broadcast a variable, and when should you NOT use it?
Broadcasting a variable means making that variable available to all worker nodes without needing to send it over the network each time it is used.
Use the broadcast() function from the SparkContext
Avoid broadcasting for large variables, frequently changing variables, and small clusters where the overhead may outweigh the benefits.
What happens if you cache a DataFrame but don’t have enough memory?
Dpending on configuration:
Spills to disk incurring a big performance hit
Other operations in the job get throttled, or might fail
OutOfMemory Error, probably crashing the job
What’s the difference between persist(StorageLevel.MEMORY_AND_DISK) and cache()?
Both commonly used to store intermediate data but:
cache() will write to the default storage location (memory with disk fallback)
persist() allow you to fine-tune where the result is written to
How does Spark SQL optimize query execution?
Cost-Based Optimisation and Rule-based optimise
Predicate and Aggregation
Pushdown to reduce the amount of data processed as close to source as possible
Adaptive Query Engine to refine the plan sending new data to the Driver during execution, specifically during shuffle phases, and reingaging Catalyst.
There are other things to like shuffle optimisation, and partition pruning…
Why does Spark shuffle data, and how can you reduce shuffling?
Shuffling and is a “symptom” of partitioning - Sometimes data needs to be bought back together from multiple partitions. Spark needs to ensure that data with the same relevant criteria ends up on the same node for processing. Eg. Aggregations
To reduce:
Avoid Wide Transformations When Possible
Bucketing ensures that the data is pre-shuffled into fixed-size buckets based on a column, so Spark can process the data more efficiently during a join.
Careful use of Repartition (a full shuffle) and Coalesce (to reduce partitions)
Cache or Persist Intermediate Results
Give a use case for repartition() and coalesce()
Repartition: It can help when data is unevenly distributed or to speed up operations by increasing parallelism.
Coalesce: After a wide transformation, you can use .coalesce() to reduce the number of partitions in the data without a full shuffle, leading to smaller overheads.
Can you explain how checkpointing works in Spark Streaming?
Checkpointing is aids fault-tolerance. It maintains a record of state enabling faster recovery from crashes etc.
What are the two main forms of checkpointing?
Metadata Checkpoints
These don’t store the actual data but rather information about how to continue processing the stream from the point where it left off.
Data Checkpoints
These stores the actual state of the stream. Critical for operations like windowed computations, where Spark needs to retain the state of the stream over time.