Spark Architecture: Conceptual Understanding Flashcards

1
Q

Which of the following data structures are Spark DataFrames built on top of?

A. Arrays
B. Strings
C. RDDs
D. Vectors
E. SQL Tables

A

C. RDDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following options describes the responsibility of the executors in Spark?

A. The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.

B. The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.

C. The executors accept jobs from the driver, analyze those jobs, and return results to the driver.

D. The executors accept tasks from the driver, execute those tasks, and return results to the driver

E. The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.

A

D.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

A. Slot is another name for executor.

B. An executor runs on a single core.

C. Tasks run in parallel via slots.

D. There has to be a greater number of slots than tasks.

E. There has to be a smaller number of executors than tasks.

A

C. Tasks run in parallel via slots.

Correct. Given the assumption, an executor then has one or more “slots”, defined by the equation spark.executor.cores / spark.task.cpus. With the executor’s resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following describes the role of tasks in the Spark execution hierarchy?

A. Stages with narrow dependencies can be grouped into one task.

B. Tasks are the smallest element in the execution hierarchy

C. Tasks are the second-smallest element in the execution hierarchy.

D. Tasks with wide dependencies can be grouped into one stage.

E. Within one task, the slots are the unit of work done for each partition of the data.

A

B. Tasks are the smallest element in the execution hierarchy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is the deepest level in Spark’s execution hierarchy?

A. Job
B. Task
C. Executor
D. Slot
E. Stage

A

B. Task

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following describes the role of the cluster manager?

A. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.

B. The cluster manager schedules tasks on the cluster in client mode.

C. The cluster manager schedules tasks on the cluster in local mode.

D. The cluster manager allocates resources to the DataFrame manager

E. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.

A

E. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.

Correct. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following is the idea behind dynamic partition pruning in Spark?

A. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

B. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.

C. Dynamic partition pruning performs wide transformations on disk instead of in memory.

D. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

E. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

A

D. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

Correct - Dynamic partition pruning provides an efficient way to selectively read data from files by skipping data that is irrelevant for the query. For example, if a query asks to consider only rows which have numbers >12 in column purchases via a filter, Spark would only read the rows that match this criteria from the underlying files. This method works in an optimal way if the purchases data is in a nonpartitioned table and the data to be filtered is partitioned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following is one of the big performance advantages that Spark has over Hadoop?

A. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.

B. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.

C. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.

D. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.

E. Spark achieves performance gains for developers by extending Hadoop’s DataFrames with a user-friendly API.

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following statements about garbage collection in Spark is incorrect?

A. Serialized caching is a strategy to increase the performance of garbage collection.

B. Manually persisting RDDs in Spark prevents them from being garbage collected.

C. Optimizing garbage collection performance in Spark may limit caching ability.

D. Garbage collection information can be accessed in the Spark UI’s stage detail view.

E. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

A

B. Manually persisting RDDs in Spark prevents them from being garbage collected.

This statement is incorrect, and thus the correct answer to the question. Spark’s garbage collector will remove even persisted objects, albeit in an “LRU” fashion. LRU stands for least recently used. So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.

See the linked StackOverflow post below for more information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following describes characteristics of the Dataset API?

A. The Dataset API does not provide compile-time type safety.

B. The Dataset API does not support unstructured data.

C. The Dataset API is available in Scala, but it is not available in Python.

D. In Python, the Dataset API’s schema is constructed via type hints.

E. In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

A

C. The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following describes the difference between client and cluster execution modes?

A. In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

B. In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

C. In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

D. In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

E. In client mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

A

E. In client mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which of the following statements about RDDs is incorrect?

A. An RDD consists of a single partition

B. The high-level DataFrame API is built on top of the low-level RDD API

C. RDD stands for Resilient Distributed Dataset

D. RDDs are great for precisely instructing Spark on how to do a query

E. RDDs are immutable

A

A. An RDD consists of a single partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which of the following statements about Spark’s execution hierarchy is correct?

A. In Spark’s execution hierarchy, tasks are one layer above slots.

B. In Spark’s execution hierarchy, a job may reach over multiple stage boundaries.

C. In Spark’s execution hierarchy, a stage comprises multiple jobs.

D. In Spark’s execution hierarchy, executors are the smallest unit.

E. In Spark’s execution hierarchy, manifests are one layer above jobs.

A

B. In Spark’s execution hierarchy, a job may reach over multiple stage boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which of the following describes slots?

A. Slots are the communication interface for executors and are used for receiving commands and sending results to the driver.

B. Slots are dynamically created and destroyed in accordance with an executor’s workload.

C. A slot is always limited to a single core.

D. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution

E. To optimize I/O performance, Spark stores data on disk in multiple slots.

A

D. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following describes executors?

A. Executors are located in slots inside worker nodes.

B. After the start of the Spark application, executors are launched on a per-task basis.

C. Executors are responsible for carrying out work that they get assigned by the driver

D. The executors’ storage is ephemeral and as such it defers the task of caching data directly to the worker node thread.

E. Executors host the Spark driver on a worker-node basis.

A

C. Executors are responsible for carrying out work that they get assigned by the driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which is the highest level in Spark’s execution hierarchy?

A. Task
B. Slot
C. Executor
D. Job
E. Stage

A

D. Job

Spark’s execution hierarchy, from top to bottom is job, stage, task. Slots are part of executors. A task is executed in a slot. But slots are rather a tool to execute tasks, not part of the execution hierarchy itself. Executors are a component of a Spark cluster, but not of the execution hierarchy.

17
Q

Which of the following describes the conversion of a computational query into an execution plan in Spark?

A. Spark uses the catalog to resolve the optimized logical plan.

B. The executed physical plan depends on a cost optimization from a previous stage.

C. The catalog assigns specific resources to the physical plan

D. Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

E. The catalog assigns specific resources to the optimized memory plan.

A

B.
The executed physical plan depends on a cost optimization from a previous stage.

Correct! Spark considers multiple physical plans on which it performs a cost analysis and selects the final physical plan in accordance with the lowest-cost outcome of that analysis. That final physical plan is then executed by Spark.

18
Q

Which of the following describes characteristics of the Spark driver?

A. The Spark driver requests the transformation of operations into DAG computations from the worker nodes.

B. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.

C. The Spark driver processes partitions in an optimized, distributed fashion.

D. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.

E. The Spark driver’s responsibility includes scheduling queries for execution on worker nodes

A

E. The Spark driver’s responsibility includes scheduling queries for execution on worker nodes

19
Q

Which of the following statements about DAGs is correct?

A. DAGs can be decomposed into tasks that are executed in parallel.

B. DAG stands for “Directing Acyclic Graph”.

C. Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

D. DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

E. In contrast to transformations, DAGs are never lazily executed.

A

A.
DAGs can be decomposed into tasks that are executed in parallel.

Correct. DAGs follow Spark’s workload hierarchy. They comprise a job, which consists of stages, which consists of tasks. Some of those tasks may be executed in parallel, since they do not depend on each other. A great way to explore DAGs is through the Spark UI. You can learn more about that using the link to the Databricks blog article below (go to the “Execution DAG” section).

20
Q

Which of the following describes how Spark achieves fault tolerance?

A. Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.

B. Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.

C. If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage

D. Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.

E. Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.

A

C. If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage

21
Q

Which of the following describes Spark’s way of managing memory?

A. Spark’s memory usage can be divided into three categories: Execution, transaction, and storage.

B. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.

C. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.

D. Storage memory is used for caching partitions derived from DataFrames.

E. Spark uses a subset of the reserved system memory.

A

D. Storage memory is used for caching partitions derived from DataFrames

22
Q

Which of the following is the default storage level for persist() for a non-streaming DataFrame/Dataset?

A. MEMORY_AND_DISK
B. MEMORY_AND_DISK_SER
C. DISK_ONLY
D. MEMORY_ONLY_SER
E. MEMORY_ONLY

A

A. MEMORY_AND_DISK