Spark Flashcards

Question 1

Q

What is Apache Spark?

Answer

A

Unified analytics engine for big data processing focused on in-memory processing and more flexible programming paradigms

Question 2

Q

Mapreduce shortcomings (2)

Answer

A

Doesn’t take advantage of modern hardware
Strict programming paradigm (suffers with iterative algorithms)

Question 3

Q

Two main components in Spark architecture

Answer

A

Driver (Master)
Executor (Slaves)

Question 4

Q

What does the Driver do in spark?

Answer

A

Convert user program into tasks (computes DAG), and sends tasks to executors

Initiates “Spark Context”
Also launches a webUI

Question 5

Q

What do Spark Executors do?

Answer

A

Perform computations, storing partitions in memory

Question 6

Q

Spark Cluster Manager

Answer

A

Handle communication between driver and executors, allocates resources (can be YARN again)

Question 7

Q

Resilient Data Base (RDD)

Answer

A

Abstraction of big data file partitioned across nodes in cluster

Question 8

Q

Directed Acyclic Graph

Answer

A

Allow pipelining in Spark, create logical execution plan from user operations on RDD to optimize before execution

Question 9

Q

Lazy evaluation

Answer

A

Because of DAGs, RDD transformations (map, filter, reduce, etc) are not executed until an action is envoked, optimizing computation chain prior to processing data

Question 10

Q

Fault tolerance in Spark (process)

Answer

A

Spark stores instructions on how RDD was transformed (lineage graph) and rebuilds RDD if need be, without replicating data

Question 11

Q

Transformation

Answer

A

operations that define a new RDD from an existing one with a function

Question 12

Q

Action()

Answer

A

Compute a result that is either returned to driver program or saved to and external storage system

Question 13

Q

examples of actions

Answer

A

collect() - return elements to driver (from executor)
count() - # of element RDD
saveAsTextFile() - save RDD to DFS

Question 14

Q

Narrow Dependencies

Answer

A

allow for pipelining on the same cluster node for more optimized execution plan

Question 15

Q

Wide dependencies

Answer

A

can’t be pipelined, require data shuffling

Question 16

Q

2 spark techniques to minimize shuffling

Answer

Study These Flashcards

A

hash partitioning
range partitioning

Question 17

Q

hash partitioning

Answer

Study These Flashcards

A

divide data evenly for joins/aggregations

Question 18

Q

range partitioning

Answer

Study These Flashcards

A

partition based on value ranges (join groups 40-50, aggregate salary of those aged 40-50)

Question 19

Q

2 shared variables in spark that assist in distributed operations

Answer

Study These Flashcards

A

accumulators (counting errors, track performance)
Broadcast variables (cached for easy access)

Question 20

Q

Which of the two variables acts as almost a key/legend, when same data is needed across tasks

Answer

Study These Flashcards

A

Broadcast variables

Spark Flashcards

(20 cards)