Exam Q's Flashcards
What are slots?
Slots are resources for parallelization within a Spark application
What is a combination of a block of data and a set of transformers that will run on a single executor?
TASK
What is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?
Stage
What is a shuffle?
A shuffle is the process by which data is compared across partitions
What operations will trigger evaluation? (Actions!)
show(), save(), count(), collect()
Which of the following describes the difference between transformations and actions?
Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.
Describe Spark’s execution/deployment mode?
Spark’s execution/deployment mode determines where the driver and executors are physically located when a Spark application is run
Describe out-of-memory errors in Spark?
An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it.
Explain persist and it’s default storage level?
In Databricks Apache Spark, persist() is used to cache or persist an RDD, DataFrame, or Dataset in memory or on disk. The storage level specifies how and where to persist or cache the data, and it is passed as an argument to the persist() method. cache() is a shorthand for persist() with the default storage level, which is MEMORY_ONLY for RDDs and MEMORY_AND_DISK for DataFrames and Datasets. Both methods are used to optimize Spark computations by saving interim partial results so they can be reused in subsequent stages. These interim results are kept in memory or on disk and can be replicated as well.
Explain cache?
A mechanism that stores frequently accessed data in the memory of a cluster’s workers to speed up subsequent queries. Lazy transformation - Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.
What is a Broadcast Variable?
A broadcast variable is entirely cached on each worker node so it doesn’t need to be shipped or shuffled between nodes with each stage. Read-only/ immutable
Which operations can be used to convert a DataFrame column from one type to another type?
col().cast()
What does explode() do?
It is a function that is used to transform an array or a map column into multiple rows, with each row containing one element of the array or map.
A code block that returns a DataFrame where rows in DataFrame ‘storesDF’
containing missing values in every column have been dropped?
storesDF.na.drop(“all”)
Write code that returns the number of rows in DataFrame storesDF?
storesDF.count()