Architecture Flashcards

1
Q

A dataframe is immutable, True/False

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are changes tracked on dataframes

A

The initial state is unchangeable and kept on each node. Modifications are shared with each node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you see the lineage of a data frame

A

.explain(“formated”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What triggers a transformation on a dataframe

A

action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A transformation where one partition results in one output partition is called what

A

Narrow Transformation or Narrow Dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the parsed logical plan and the analyzed logical plan, which uses the catalog

A

analyzed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how many cpu cores per partitions

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

T/F Cluster Manager is a component of a Spark App

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where is the driver in deploy-mode cluster

A

On a node inside the cluster. The Cluster Manager is responsible for maintaining the cluster and executor nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where is the driver in deploy-mode client

A

On a node not in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is there a performance difference between writing SQL Queries or DataFrame Code

A

NO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kind of programming model is Spark

A

Functional - Same inputs lead to the same outputs; transformations are constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When you perform a shuffle, Spark outputs how many partitions

A

200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is schema inference

A

Take the best guess at what the schema of our data frame should be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What port does the spark ui run

A

4040

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of transformation is aggregation

A

wide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of transformation is filter

A

Narrow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 3 kind of actions

A
  • View data in the console
  • collect data to native objects in the respective language
  • write to output data sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

.count() is an example of a what

A

an action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is predicate pushdown

A

pushing down the filter automatically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is lazy evaluation

A

Spark will wait till the very last moment to execute the graph of computation instructions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Shuffles will perform filters and then…

A

Write to disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is pipelining

A

on narrow transformations filters will be performed in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A wide dependency is

A

Input partitions contributing to many output partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is narrow dependencies

A

Each input partition will contribute to only one output partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the 2 type of transformations

A

Narrow dependencies and wide dependencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Spark will not act on a transformations till

A

an action is called

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Core data Structures are muttable or immutable

A

immutable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

With Dataframes, you have to manipulate partitions manually

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

If you ave one partition and many executors, what paralism do you have

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is a partition

A

A collection of rows that sit on one physical machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

To allow every executor to perform in parallel, Spark breaks the data into

A

Partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is a dataframe

A

Structured API that represents a table of data with rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How many spark sessions can you have across a Spark App

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

You control your SparkApp through a driver process called

A

Spark Session

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are Spark’s Language APIS

A

Scala, JAVA, R, Python, SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the point of the cluster manager

A

Keep track of resources available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is local mode

A

Driver and Executor live on the same machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the 3 core cluster managers

A

Spark’s Standalone Manager
Yarn
Mesos

40
Q

The driver process is responsible for what 3 things

A

Maintaining info about the Spark App
Respond to the user program and input
Analyze, distribute and schedule work across executors

41
Q

Which process runs your main() function

A

driver

42
Q

A spark app consists of what two processes

A

Driver

Executor

43
Q

Executors are responsible for what two things

A

Executing code assigned to it

Reporting state of the computation back to the driver node

44
Q

At which stage do the first set of optimizations take place?

A

Logical Optimization

45
Q

When using DataFrame.persist() data on disk is always serialized. T/F

A

True

46
Q

Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions. Which property needs to be enabled to achieve this ?

A

spark.sql.adative.skewJoin.enabled

47
Q

The goal of Dynamic Partition Pruning (DPP) is to allow you to read only as much data as you need. Which property needs to be set in order to use this functionality ?

A

spark.sql.optimizer.dynamicPartitionPruning.enabled

48
Q

The DataFrame class does not have an uncache() operation T/F

A

True

49
Q

What are worker nodes

A

Worker nodes are the nodes of a cluster that perform computations

50
Q

For text files, we can only have one column of a dataframe we want to write T/F

A

True

51
Q

How do you specify a left outer join

A

left_outer

52
Q

A job is

A

A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).

53
Q

What is the relationship between an executor and a worker

A

An executor is a Java Virtual Machine (JVM) running on a worker node.

54
Q

How are global temp views addressed

A

spark.read.table(“global_temp.whatever the view is”)

55
Q

When is a data frame writer treated as a global external/unmanaged table

A

Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.

56
Q

What are the possible strategies in order to decrease garbage collection time ?

A

Persist objects in serialized form
Create fewer objects
Increase java heap space size

57
Q

Which property is used to scale up and down dynamically based on applications current number of pending tasks in a spark cluster ?

A

Dynamic Allocation

58
Q

If spark is running in client mode, where is the driver located

A

on the client machine that submitted the application

59
Q

What causes a stage boundry

A

a shuffle

60
Q

What function will avoid a shuffle if the new partitions are known to be less than the existing partitions

A

.coalesce(lesser number)

61
Q

When will a broadcast join be forced

A

By default spark.sql.autoBroadcastJoinThreshold= 10MB and any value above this thershold will not force a broadcast join.

62
Q

What command can we use to get the number of partition of a dataframe name df ?

A

df.rdd.getNumPartitions()

63
Q

Layout the Catalyst Optimizer steps

A

SQL Query | Data Frame to
Unresolved Logical Plan to (analysis) (Catalog used)
Logical plan to (logical optimization)
Optimized Logical Plan to (physical planning)
physical plans to
cost model to
selected physical plan to (code generation)
rdds

64
Q

What is dynamic allocation?

A

If you are running multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload.

65
Q

What is required to turn on dynamic allocation

A

spark.dynamicAllocation.enabled to true
set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application

66
Q

What is the purpose of the external shuffle service

A

The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them.

67
Q

What is the default file format for output

A

Parquet

68
Q

is .25 an acceptabe input for a fraction

A

no

69
Q

What does adaptive query execution (AQE) allow you to do?

A

AQE attempts to to do the following at runtime:

  1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle partitions.
  2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin into a BroadcastHashJoin where appropriate.
  3. Handle data skew during a join.
70
Q

What can be done with Spark catalyst optimizer

A
  1. Dynamically convert physical plans to RDDs.
  2. Dynamically reorganize query orders.
  3. Dynamically select physical plans based on cost.
71
Q

What is an equivalent to equivalent code block to:

df.filter(col(“count”) < 2)

A

df.where(“count < 2”)

72
Q

What is the purpose of a cluster manager

A

The cluster manager allocates resources to the Spark Applications and maintains the executor process in client mode

73
Q

What is the idea behind dynamic partition pruning in Spark

A

skip over data you do not need in the results of the query

74
Q

Will spark’s garbage collector clean up persisted objects

A

yes, but in least recently used

75
Q

The Dataset API is not available in Python T/F

A

True

76
Q

A viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster

A

increase the values for spark.default.parallelism and spark.sql.shuffle.partitions

76
Q

A viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster

A

increase the values for spark.default.parallelism and spark.sql.shuffle.partitions

77
Q

Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of what

A

Dynamic Partition Pruning

78
Q

What is a Stage

A

A stage represent a group of tasks that can be executed together to perform the same operation on multiple executors in parallel.
A stage is a combination of transformations which does not cause any shuffling of data across nodes.
Spark starts a new stage when shuffling is required in the job.

79
Q

How many executors is a task sent to

A

1

80
Q

What is a task

A

Each task is a combination of blocks of data and a set of transformations that will run on a single executor.

81
Q

What is a possibility if the number of partitions is too small

A

If the number is too small it will reduce concurrency and possibly cause data skewing.

82
Q

If there are too many partitions…

A

there will be a mismatch between task scheduling and task execution.

83
Q

What is coalesce

A

Collapses partition on the same worker to avoid shuffling.

84
Q

What are some examples of transformations

A
select
sum
groupBy
orderBy
filter
limit
85
Q

What are examples of an action

A

Show
Count
Collect
save

86
Q

Coalesce cannot be used to increase the number of partitions T/F

A

True

87
Q

Is printSchema considered an action

A

No

88
Q

Is first considered an action

A

Yes

89
Q

When chosing storage method, what means seralized

A

SER

90
Q

A driver

A

runs your main() function
assigns work to be done in parallel
maintains information about the Spark Application

91
Q

What happens at a stage boundary in spark

A

data is written to the disk by tasks in the parent stages and the fetched over the network by tasks in the child stage

92
Q

is .forEach an action

A

yes

93
Q

is limit() considered an action

A

no

94
Q

In cluster mode the driver will be put onto a worker node t/f

A

true