Overview 3 Flashcards
What is the Spark driver
The Driver is the central coordinating component in Spark. It runs the SparkContext (or SparkSession in Spark SQL) and orchestrates the entire job’s execution.
The Driver is responsible for:
Creating the SparkContext that connects to the cluster.
Building the Directed Acyclic Graph (DAG) of operations.
Scheduling tasks and sending them to executors.
Collecting results after tasks complete.
In simple terms: The Driver is the “brain” that controls the flow of execution in a Spark job.
What is spark cluster manageer
The Cluster Manager is responsible for allocating resources in the cluster.
It manages the overall execution environment for Spark jobs.
It assigns resources (CPU, memory) to the Executors based on availability and requirements.
Common Cluster Managers Spark can use:
YARN (Hadoop’s resource manager).
Standalone (Spark’s built-in resource manager).
Kubernetes (for containerized execution).
In simple terms: The Cluster Manager manages resources and ensures executors have enough resources to run the tasks.
What is the spark executor?
Executors are the workers in a Spark job. They are responsible for actually executing the code that you define in the transformations and actions of your Spark job.
Every Spark job runs on one or more executors, and each executor runs tasks in parallel.
-Each Executor is responsible for executing the tasks assigned to it.
-Executors store cached data in memory and perform data processing tasks (like map, reduce, etc.).
Key Points:
Executors are distributed across the cluster.
They are managed by the Cluster Manager.
They store intermediate results and return the final result to the Driver.
In simple terms: Executors are the “workers” that process the data and perform the actual computation.
What is the spark task?
A task is a unit of work that gets executed on a partition of the data.
-Spark divides a job (DAG) into stages, and each stage is divided into smaller tasks.
-Each task runs independently on a single partition of the data.
-Tasks are scheduled and sent to Executors by the Driver.
In simple terms: A task is the smallest unit of work executed on a single partition.
What is the spark stage. How does it relate to DAG
-Spark’s DAG Scheduler divides the job into stages. Each stage consists of tasks that can be run in parallel.
-A stage is defined by the set of transformations that do not require shuffling of data (like map, filter).
-Shuffling (for example, groupBy, reduceByKey) introduces boundaries between stages because data needs to be redistributed across nodes.