Exam Q's Flashcards

1
Q

What are slots?

A

Slots are resources for parallelization within a Spark application

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a combination of a block of data and a set of transformers that will run on a single executor?

A

TASK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?

A

Stage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a shuffle?

A

A shuffle is the process by which data is compared across partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What operations will trigger evaluation? (Actions!)

A

show(), save(), count(), collect()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following describes the difference between transformations and actions?

A

Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Spark’s execution/deployment mode?

A

Spark’s execution/deployment mode determines where the driver and executors are physically located when a Spark application is run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe out-of-memory errors in Spark?

A

An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain persist and it’s default storage level?

A

In Databricks Apache Spark, persist() is used to cache or persist an RDD, DataFrame, or Dataset in memory or on disk. The storage level specifies how and where to persist or cache the data, and it is passed as an argument to the persist() method. cache() is a shorthand for persist() with the default storage level, which is MEMORY_ONLY for RDDs and MEMORY_AND_DISK for DataFrames and Datasets. Both methods are used to optimize Spark computations by saving interim partial results so they can be reused in subsequent stages. These interim results are kept in memory or on disk and can be replicated as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain cache?

A

A mechanism that stores frequently accessed data in the memory of a cluster’s workers to speed up subsequent queries. Lazy transformation - Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Broadcast Variable?

A

A broadcast variable is entirely cached on each worker node so it doesn’t need to be shipped or shuffled between nodes with each stage. Read-only/ immutable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which operations can be used to convert a DataFrame column from one type to another type?

A

col().cast()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does explode() do?

A

It is a function that is used to transform an array or a map column into multiple rows, with each row containing one element of the array or map.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A code block that returns a DataFrame where rows in DataFrame ‘storesDF’
containing missing values in every column have been dropped?

A

storesDF.na.drop(“all”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Write code that returns the number of rows in DataFrame storesDF?

A

storesDF.count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly