Week 6: Apache Spark Flashcards
Apache Spark
It’s a unified engine for distributed data processing. Spark extends the MapReduce programming model with abstraction that allows efficient data reuse.
Resilient Distributed Datasets (RDD’s)
It allows for efficient data reuse. RDD’s are read-only, fault-tolerant collections of records that can be operated in parallel. They’re data structures that serve as the core unit of data in Spark. It can be manipulated, its partitioning controlled, and can be made persistent in memory. All RDD’s may or may not be materialised.
RDD: Lineage
Each RDD knows its lineage, which is how it was derived from other datasets, and can compute its partitions from data in stable storage. Only RDD’s that can be reconstructed after failure can be referenced by the user’s programme.
Representation can track lineage and provide transformations that can be composed arbitrarily. A representation includes a set of partitions, a set of dependencies on the parent RDD’s, a function for computing the dataset based on its parents, and the metadata about the partitioning scheme and data placement.
RDD: Persistence
Users can specify which RDD’s they’ll reuse and choose storage strategies for them.
RDD: Partitioning
Users can specify how records of RDD’s are partitioned across machines based on the key in each record.
Filter Example
rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x % 2 == 0).collect()
Output
[2,4]
Map Example
rdd = sc.parallelize([2,3,4])
rdd.map(lambda x: range(1,x)).collect()
Output
[[1],[1,2],[1,2,3]]
FlatMap Example
rdd = sc.parallelize([2,3,4])
rdd.flatMap(lambda x: range(1,x)).collect()
Output
[1,1,2,1,2,3,]
RDD: Storage Strategies
By default, RDD’s are stored in RAM. If there’s not enough RAM, the RDD’s are spilled to the disk. Users can also store RDD’s only on the disk, replicate RDD’s across machines and use flags to persist, or set the persistence priority on each RDD to specify which in-memory data should spill to the disk first.
RDD: Lazy Evaluation
Spark executes only after seeing the first action, and it lazy evaluates transformations by recording metadata for them. Lazy evaluation provides a global view, so Spark can optimise the required calculations by grouping operations together and recovers from failures and slow workers.
RDD: Narrow Dependencies
Each partition of the parent RDD is used by at most 1 partition of the child RDD. It allows for pipeline execution on the node. It has efficient failure recovery, as only lost parent partitions need to be recomputed and recomputation can be parallel.
RDD: Wide Dependencies
Multiple child partitions may depend on a single partition of the parent RDD. It requires data from all parent partitions, in order to be available and shuffled across nodes. Failure recovery involves many RDD’s and complete re-execution may be needed.
RDD: Advantages
- Distributed Shared Memory (DSM), which is a global address space that applications can read and write to in arbitrary locations.
- RDD’S are created by coarse-grained transformations (applied to the entire dataset), but reads can be fine-grained (read from a specific location).
- RDD’s are restricted to applications performing bulk writes, but provide efficient fault tolerance.
- RDD’s are read-only, so Spark can run backup copies of slow tasks without accessing the same memory. Spark can distribute data over different nodes to run computations in parallel.
- Runtime can schedule tasks based on data locality to improve performance.
- RDD’s degrade gracefully when there isn’t enough memory to store them.
RDD: Applications
They’re great for batch applications that perform the same operations on the entire dataset.
RDD’s aren’t suitable for applications that make asynchronous fine-grained updates to the shared state.
Spark Programming Interface
The driver programme defines the RDD’s, invokes actions on them, and tracks the RDD’s lineage. The processes read data blocks from the distributed file system and store RDD partitions in RAM. The processes comprise of workers which receive tasks from the driver programme.
Spark’s Cluster Mode Components
- Cluster Manager: allocates resources across applications.
- Driver Programme: must listen for and accept incoming connections from its executors throughout its lifetime. Should run close to workers, meaning it should be in the same local area network.
- Executor: this is a process that performs computations and stores data.
- Task: unit of work that’ll be sent to the executor.
- SparkContext: connects to the cluster manager, acquires executors, sends application code and tasks to executors.