Spark Architecture: Conceptual Understanding Flashcards
Which of the following data structures are Spark DataFrames built on top of?
A. Arrays
B. Strings
C. RDDs
D. Vectors
E. SQL Tables
C. RDDs
Which of the following options describes the responsibility of the executors in Spark?
A. The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.
B. The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
C. The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
D. The executors accept tasks from the driver, execute those tasks, and return results to the driver
E. The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
D.
Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?
A. Slot is another name for executor.
B. An executor runs on a single core.
C. Tasks run in parallel via slots.
D. There has to be a greater number of slots than tasks.
E. There has to be a smaller number of executors than tasks.
C. Tasks run in parallel via slots.
Correct. Given the assumption, an executor then has one or more “slots”, defined by the equation spark.executor.cores / spark.task.cpus. With the executor’s resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel.
Which of the following describes the role of tasks in the Spark execution hierarchy?
A. Stages with narrow dependencies can be grouped into one task.
B. Tasks are the smallest element in the execution hierarchy
C. Tasks are the second-smallest element in the execution hierarchy.
D. Tasks with wide dependencies can be grouped into one stage.
E. Within one task, the slots are the unit of work done for each partition of the data.
B. Tasks are the smallest element in the execution hierarchy
Which of the following is the deepest level in Spark’s execution hierarchy?
A. Job
B. Task
C. Executor
D. Slot
E. Stage
B. Task
The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
Which of the following describes the role of the cluster manager?
A. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
B. The cluster manager schedules tasks on the cluster in client mode.
C. The cluster manager schedules tasks on the cluster in local mode.
D. The cluster manager allocates resources to the DataFrame manager
E. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
E. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
Correct. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver.
Which of the following is the idea behind dynamic partition pruning in Spark?
A. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
B. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
C. Dynamic partition pruning performs wide transformations on disk instead of in memory.
D. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
E. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
D. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
Correct - Dynamic partition pruning provides an efficient way to selectively read data from files by skipping data that is irrelevant for the query. For example, if a query asks to consider only rows which have numbers >12 in column purchases via a filter, Spark would only read the rows that match this criteria from the underlying files. This method works in an optimal way if the purchases data is in a nonpartitioned table and the data to be filtered is partitioned.
Which of the following is one of the big performance advantages that Spark has over Hadoop?
A. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.
B. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
C. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
D. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
E. Spark achieves performance gains for developers by extending Hadoop’s DataFrames with a user-friendly API.
A
Which of the following statements about garbage collection in Spark is incorrect?
A. Serialized caching is a strategy to increase the performance of garbage collection.
B. Manually persisting RDDs in Spark prevents them from being garbage collected.
C. Optimizing garbage collection performance in Spark may limit caching ability.
D. Garbage collection information can be accessed in the Spark UI’s stage detail view.
E. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
B. Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark’s garbage collector will remove even persisted objects, albeit in an “LRU” fashion. LRU stands for least recently used. So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information
Which of the following describes characteristics of the Dataset API?
A. The Dataset API does not provide compile-time type safety.
B. The Dataset API does not support unstructured data.
C. The Dataset API is available in Scala, but it is not available in Python.
D. In Python, the Dataset API’s schema is constructed via type hints.
E. In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C. The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.
Which of the following describes the difference between client and cluster execution modes?
A. In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
B. In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.
C. In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
D. In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.
E. In client mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.
E. In client mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.
Which of the following statements about RDDs is incorrect?
A. An RDD consists of a single partition
B. The high-level DataFrame API is built on top of the low-level RDD API
C. RDD stands for Resilient Distributed Dataset
D. RDDs are great for precisely instructing Spark on how to do a query
E. RDDs are immutable
A. An RDD consists of a single partition
Which of the following statements about Spark’s execution hierarchy is correct?
A. In Spark’s execution hierarchy, tasks are one layer above slots.
B. In Spark’s execution hierarchy, a job may reach over multiple stage boundaries.
C. In Spark’s execution hierarchy, a stage comprises multiple jobs.
D. In Spark’s execution hierarchy, executors are the smallest unit.
E. In Spark’s execution hierarchy, manifests are one layer above jobs.
B. In Spark’s execution hierarchy, a job may reach over multiple stage boundaries.
Which of the following describes slots?
A. Slots are the communication interface for executors and are used for receiving commands and sending results to the driver.
B. Slots are dynamically created and destroyed in accordance with an executor’s workload.
C. A slot is always limited to a single core.
D. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution
E. To optimize I/O performance, Spark stores data on disk in multiple slots.
D. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution
Which of the following describes executors?
A. Executors are located in slots inside worker nodes.
B. After the start of the Spark application, executors are launched on a per-task basis.
C. Executors are responsible for carrying out work that they get assigned by the driver
D. The executors’ storage is ephemeral and as such it defers the task of caching data directly to the worker node thread.
E. Executors host the Spark driver on a worker-node basis.
C. Executors are responsible for carrying out work that they get assigned by the driver