Spark Flashcards

1
Q

Can you explain what happens when a PySpark program is executed?

A

When a PySpark program is submitted, the first step is to construct a logical execution plan, represented as a DAG. This DAG captures the sequence of transformations and actions defined in the program.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a stage is within the context of PySpark?

A

A stage is a collection of tasks that can be executed in parallel. PySpark optimizes the execution of transformations by dividing them into stages based on data dependencies. Stages are typically delineated by shuffle operations, such as joins or aggregations, where data needs to be exchanged between partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does PySpark determine the execution plan and break it down into stages?

A

PySpark’s Catalyst optimizer analyzes the logical execution plan and applies optimizations to generate an optimized physical execution plan. This plan is then broken down into stages based on data dependencies and transformations that can be performed in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What role do tasks play in the execution of a PySpark program?

A

Tasks are the smallest units of work in PySpark. Each stage consists of one or more tasks that are executed on partitions of the input data. Tasks perform transformations or actions on the data and are executed in parallel across the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Walk through the sequence of events that occur when a PySpark program is submitted?

A

When a PySpark program is submitted, the driver node creates a SparkContext, which coordinates the execution of the program. The program defines a series of transformations and actions on RDDs or DataFrames, which are translated into a logical execution plan represented as a DAG. This DAG is optimized by the Catalyst optimizer to generate an optimized physical execution plan. The plan is then broken down into stages, each consisting of tasks that are executed in parallel across the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What would be your initial steps in troubleshooting a slow Spark job?

A

Check the Spark UI to gather information about the job’s execution, including task progress, stage durations, and resource utilization. This can help identify bottlenecks and performance issues within the job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some specific metrics or indicators you’d look for in the Spark UI when analysing a slow job?

A

Task duration
Shuffle read/write times
Executor CPU
Memory utilization
Garbage collection activity

These metrics can provide insights into potential performance bottlenecks, such as data skew, resource contention, or inefficient task execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What steps would you take if you suspect resource contention as the cause of the slowdown?

A

Examine executor CPU and memory utilization to identify any resource bottlenecks.

Increasing executor memory or adjusting the number of executors can help alleviate resource contention and improve job performance.

Additionally, optimizing resource allocation and task scheduling parameters in the Spark configuration can further optimize resource utilization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Suppose you’ve optimized resource utilization, but the job is still running slower than expected. What other factors would you consider?

A

Investigate inefficient data processing logic, data skew, and suboptimal partitioning strategies.

Analyzing the job’s DAG and execution plan can provide insights into the data processing flow and identify opportunities for optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How would you ensure the stability and reliability of a Spark job after optimization?

A

After troubleshooting and optimization, validate the stability and reliability of the Spark job under varying workload conditions and data scenarios. This includes performance testing, stress testing, and fault tolerance testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly