Architecture Flashcards

Question

What is narrow dependencies

Answer 1

Each input partition will contribute to only one output partition

Answer 2

Narrow dependencies and wide dependencies

Answer 3

an action is called

Answer 4

A collection of rows that sit on one physical machine

Answer 5

Partitions

Answer 6

Structured API that represents a table of data with rows and columns

Answer 7

Spark Session

Answer 8

Scala, JAVA, R, Python, SQL

Answer 9

Keep track of resources available

Answer 10

Driver and Executor live on the same machine

Answer 11

Spark's Standalone Manager Yarn Mesos

Answer 12

Maintaining info about the Spark App Respond to the user program and input Analyze, distribute and schedule work across executors

Answer 13

Driver | Executor

Answer 14

Executing code assigned to it | Reporting state of the computation back to the driver node

Answer 15

Logical Optimization

Answer 16

spark.sql.adative.skewJoin.enabled

Answer 17

spark.sql.optimizer.dynamicPartitionPruning.enabled

Answer 18

Worker nodes are the nodes of a cluster that perform computations

Answer 19

left_outer

Answer 20

A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).

Answer 21

An executor is a Java Virtual Machine (JVM) running on a worker node.

Answer 22

spark.read.table("global_temp.whatever the view is")

Answer 23

Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.

Answer 24

Persist objects in serialized form Create fewer objects Increase java heap space size

Answer 25

Dynamic Allocation

Answer 26

on the client machine that submitted the application

Answer 27

.coalesce(lesser number)

Answer 28

By default spark.sql.autoBroadcastJoinThreshold= 10MB and any value above this thershold will not force a broadcast join.

Answer 29

df.rdd.getNumPartitions()

Answer 30

SQL Query | Data Frame to Unresolved Logical Plan to (analysis) (Catalog used) Logical plan to (logical optimization) Optimized Logical Plan to (physical planning) physical plans to cost model to selected physical plan to (code generation) rdds

Answer 31

If you are running multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload.

Answer 32

spark.dynamicAllocation.enabled to true set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application

Answer 33

The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them.

Answer 34

AQE attempts to to do the following at runtime: 1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle partitions. 2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin into a BroadcastHashJoin where appropriate. 3. Handle data skew during a join.

Answer 35

1. Dynamically convert physical plans to RDDs. 2. Dynamically reorganize query orders. 3. Dynamically select physical plans based on cost.

Answer 36

df.where("count < 2")

Answer 37

The cluster manager allocates resources to the Spark Applications and maintains the executor process in client mode

Answer 38

skip over data you do not need in the results of the query

Answer 39

yes, but in least recently used

Answer 40

increase the values for spark.default.parallelism and spark.sql.shuffle.partitions

Answer 41

increase the values for spark.default.parallelism and spark.sql.shuffle.partitions

Answer 42

Dynamic Partition Pruning

Answer 43

A stage represent a group of tasks that can be executed together to perform the same operation on multiple executors in parallel. A stage is a combination of transformations which does not cause any shuffling of data across nodes. Spark starts a new stage when shuffling is required in the job.

Answer 44

Each task is a combination of blocks of data and a set of transformations that will run on a single executor.

Answer 45

If the number is too small it will reduce concurrency and possibly cause data skewing.

Answer 46

there will be a mismatch between task scheduling and task execution.

Answer 47

Collapses partition on the same worker to avoid shuffling.

Answer 48

``` select sum groupBy orderBy filter limit ```

Answer 49

Show Count Collect save

Answer 50

runs your main() function assigns work to be done in parallel maintains information about the Spark Application

Answer 51

data is written to the disk by tasks in the parent stages and the fetched over the network by tasks in the child stage

Architecture Flashcards

(95 cards)