Spark Architecture: Applied Understanding Flashcards
Which of the following describes characteristics of the Spark UI?
A. Via the Spark UI, workloads can be manually distributed across distributors.
B. There is a place in the Spark UI that shows the property spark.executor.memory.
C. Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.
D. Via the Spark UI, stage execution speed can be modified.
E. The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.
B. There is a place in the Spark UI that shows the property spark.executor.memory.
Correct, you can see Spark properties such as spark.executor.memory in the Environment tab.
Which of the following statements about broadcast variables is correct?
A. Broadcast variables are commonly used for tables that do not fit into memory.
B. Broadcast variables are serialized with every single task.
C. Broadcast variables are immutable
D. Broadcast variables are local to all worker nodes and not shared across the cluster.
E. Broadcast variables are occasionally dynamically updated on a per-task basis.
C. Broadcast variables are immutable
Which of the following is a viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster?
A. Decrease values for the properties spark.default.parallelism and spark.sql.partitions
B. Increase values for the properties spark.sql.parallelism
and
spark.sql.partitions
C. Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions
D. Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions
E. Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
E. Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
Which of the following describes Spark’s Adaptive Query Execution?
A. Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
B. Adaptive Query Execution is enabled in Spark by default.
C. Adaptive Query Execution applies to all kinds of queries.
D. Adaptive Query Execution reoptimizes queries at execution point
D. Adaptive Query Execution reoptimizes queries at execution point
Which of the following describes the difference between transformations and actions?
A. Transformations work on DataFrames/Datasets while actions are reserved for native language objects.
B. Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.
C. Actions work on DataFrames/Datasets while transformations are reserved for native language objects.
D. Actions are business logic operations that do not induce execution while transformations are execution triggers focused on returning results.
E. There is no difference between actions and transformations.
B. Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.
Which of the following describes a shuffle?
A. A shuffle is a process that compares data between partitions.
B. A shuffle is a Spark operation that results from DataFrame.coalesce().
C. A shuffle is a process that allocates partitions to executors.
D. A shuffle is a process that is executed during a broadcast hash join.
E. A shuffle is a process that compares data across executors.
A. A shuffle is a process that compares data between partitions.
This is correct. During a shuffle, data is compared between partitions because shuffling includes the process of sorting. For sorting, data need to be compared. Since per definition, more than one partition is involved in a shuffle, it can be said that data is compared across partitions. You can read more about the technical details of sorting in the blog post linked below.
Which of the following statements about Spark’s DataFrames is incorrect?
A. Spark’s DataFrames are equal to Python’s or R’s DataFrames.
B. Spark’s DataFrames are immutable.
C. RDDs are at the core of DataFrames.
D. Data in DataFrames is organized into named columns.
E. The data in DataFrames may be split into multiple chunks.
A.
Spark’s DataFrames are equal to Python’s or R’s DataFrames.
Incorrect. They are only similar. A major difference between Spark and Python is that Spark’s DataFrames are distributed, whereby Python’s are not.
Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?
A. Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
B. Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
C. Use a narrow transformation to reduce the number of partitions.
D. Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
E. Use a wide transformation to reduce the number of partitions.
C.
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the DataFrame.
Which of the following DataFrame operators is never classified as a wide transformation?
A. DataFrame.sort()
B. DataFrame.repartition()
C. DataFrame.join()
D. DataFrame.select()
E. DataFrame.aggregate()
D. select()
If shuffling is involved then it is wide. If not then usually narrow.
A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation.
Which of the following describes the characteristics of accumulators?
A. Accumulators are immutable.
B. If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
C. All accumulators used in a Spark application are listed in the Spark UI.
D. Accumulators are used to pass around lookup tables across the cluster.
E. Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.
B.
If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
Correct, when Spark tries to rerun a failed action that includes an accumulator, it will only update the accumulator if the action succeeded.
Which of the following statements about storage levels is incorrect?
A. DISK_ONLY will not use the worker node’s memory.
B. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
C. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node’s memory.
D. Caching can be undone using the DataFrame.unpersist() operator.
E. The cache operator on DataFrames is evaluated like a transformation.
B.
MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory.