Spark Flashcards
What is Apache Spark?
Unified analytics engine for big data processing focused on in-memory processing and more flexible programming paradigms
Mapreduce shortcomings (2)
- Doesn’t take advantage of modern hardware
- Strict programming paradigm (suffers with iterative algorithms)
Two main components in Spark architecture
- Driver (Master)
- Executor (Slaves)
What does the Driver do in spark?
Convert user program into tasks (computes DAG), and sends tasks to executors
Initiates “Spark Context”
Also launches a webUI
What do Spark Executors do?
Perform computations, storing partitions in memory
Spark Cluster Manager
Handle communication between driver and executors, allocates resources (can be YARN again)
Resilient Data Base (RDD)
Abstraction of big data file partitioned across nodes in cluster
Directed Acyclic Graph
Allow pipelining in Spark, create logical execution plan from user operations on RDD to optimize before execution
Lazy evaluation
Because of DAGs, RDD transformations (map, filter, reduce, etc) are not executed until an action is envoked, optimizing computation chain prior to processing data
Fault tolerance in Spark (process)
Spark stores instructions on how RDD was transformed (lineage graph) and rebuilds RDD if need be, without replicating data
Transformation
operations that define a new RDD from an existing one with a function
Action()
Compute a result that is either returned to driver program or saved to and external storage system
examples of actions
collect() - return elements to driver (from executor)
count() - # of element RDD
saveAsTextFile() - save RDD to DFS
Narrow Dependencies
allow for pipelining on the same cluster node for more optimized execution plan
Wide dependencies
can’t be pipelined, require data shuffling