Spark Flashcards

1
Q

What is Apache Spark?

A

Unified analytics engine for big data processing focused on in-memory processing and more flexible programming paradigms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mapreduce shortcomings (2)

A
  1. Doesn’t take advantage of modern hardware
  2. Strict programming paradigm (suffers with iterative algorithms)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Two main components in Spark architecture

A
  1. Driver (Master)
  2. Executor (Slaves)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the Driver do in spark?

A

Convert user program into tasks (computes DAG), and sends tasks to executors

Initiates “Spark Context”
Also launches a webUI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do Spark Executors do?

A

Perform computations, storing partitions in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Spark Cluster Manager

A

Handle communication between driver and executors, allocates resources (can be YARN again)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Resilient Data Base (RDD)

A

Abstraction of big data file partitioned across nodes in cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Directed Acyclic Graph

A

Allow pipelining in Spark, create logical execution plan from user operations on RDD to optimize before execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lazy evaluation

A

Because of DAGs, RDD transformations (map, filter, reduce, etc) are not executed until an action is envoked, optimizing computation chain prior to processing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Fault tolerance in Spark (process)

A

Spark stores instructions on how RDD was transformed (lineage graph) and rebuilds RDD if need be, without replicating data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Transformation

A

operations that define a new RDD from an existing one with a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Action()

A

Compute a result that is either returned to driver program or saved to and external storage system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

examples of actions

A

collect() - return elements to driver (from executor)
count() - # of element RDD
saveAsTextFile() - save RDD to DFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Narrow Dependencies

A

allow for pipelining on the same cluster node for more optimized execution plan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Wide dependencies

A

can’t be pipelined, require data shuffling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

2 spark techniques to minimize shuffling

A

hash partitioning
range partitioning

17
Q

hash partitioning

A

divide data evenly for joins/aggregations

18
Q

range partitioning

A

partition based on value ranges (join groups 40-50, aggregate salary of those aged 40-50)

19
Q

2 shared variables in spark that assist in distributed operations

A

accumulators (counting errors, track performance)
Broadcast variables (cached for easy access)

20
Q

Which of the two variables acts as almost a key/legend, when same data is needed across tasks

A

Broadcast variables