1. Distributed Shared Memory (DSM), which is a global address space that applications can read and write to in arbitrary locations. 2. RDD'S are created by coarse-grained transformations (applied to the entire dataset), but reads can be fine-grained (read from a specific location). 3. RDD's are restricted to applications performing bulk writes, but provide efficient fault tolerance. 4. RDD's are read-only, so Spark can run backup copies of slow tasks without accessing the same memory. Spark can distribute data over different nodes to run computations in parallel. 5. Runtime can schedule tasks based on data locality to improve performance. 6. RDD's degrade gracefully when there isn't enough memory to store them.

Week 6: Apache Spark Flashcards by Henry Cao

Apache Spark

It’s a unified engine for distributed data processing. Spark extends the MapReduce programming model with abstraction that allows efficient data reuse.

How well did you know this?

Not at all

Perfectly

Resilient Distributed Datasets (RDD’s)

It allows for efficient data reuse. RDD’s are read-only, fault-tolerant collections of records that can be operated in parallel. They’re data structures that serve as the core unit of data in Spark. It can be manipulated, its partitioning controlled, and can be made persistent in memory. All RDD’s may or may not be materialised.

How well did you know this?

Not at all

Perfectly

RDD: Lineage

Each RDD knows its lineage, which is how it was derived from other datasets, and can compute its partitions from data in stable storage. Only RDD’s that can be reconstructed after failure can be referenced by the user’s programme.

Representation can track lineage and provide transformations that can be composed arbitrarily. A representation includes a set of partitions, a set of dependencies on the parent RDD’s, a function for computing the dataset based on its parents, and the metadata about the partitioning scheme and data placement.

How well did you know this?

Not at all

Perfectly

RDD: Persistence

Users can specify which RDD’s they’ll reuse and choose storage strategies for them.

How well did you know this?

Not at all

Perfectly

RDD: Partitioning

Users can specify how records of RDD’s are partitioned across machines based on the key in each record.

How well did you know this?

Not at all

Perfectly

Filter Example

rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x % 2 == 0).collect()

Output
[2,4]

How well did you know this?

Not at all

Perfectly

Map Example

rdd = sc.parallelize([2,3,4])
rdd.map(lambda x: range(1,x)).collect()

Output
[[1],[1,2],[1,2,3]]

How well did you know this?

Not at all

Perfectly

FlatMap Example

rdd = sc.parallelize([2,3,4])
rdd.flatMap(lambda x: range(1,x)).collect()

Output
[1,1,2,1,2,3,]

How well did you know this?

Not at all

Perfectly

RDD: Storage Strategies

By default, RDD’s are stored in RAM. If there’s not enough RAM, the RDD’s are spilled to the disk. Users can also store RDD’s only on the disk, replicate RDD’s across machines and use flags to persist, or set the persistence priority on each RDD to specify which in-memory data should spill to the disk first.

How well did you know this?

Not at all

Perfectly

RDD: Lazy Evaluation

Spark executes only after seeing the first action, and it lazy evaluates transformations by recording metadata for them. Lazy evaluation provides a global view, so Spark can optimise the required calculations by grouping operations together and recovers from failures and slow workers.

How well did you know this?

Not at all

Perfectly

RDD: Narrow Dependencies

Each partition of the parent RDD is used by at most 1 partition of the child RDD. It allows for pipeline execution on the node. It has efficient failure recovery, as only lost parent partitions need to be recomputed and recomputation can be parallel.

How well did you know this?

Not at all

Perfectly

RDD: Wide Dependencies

Multiple child partitions may depend on a single partition of the parent RDD. It requires data from all parent partitions, in order to be available and shuffled across nodes. Failure recovery involves many RDD’s and complete re-execution may be needed.

How well did you know this?

Not at all

Perfectly

RDD: Advantages

Distributed Shared Memory (DSM), which is a global address space that applications can read and write to in arbitrary locations.
RDD’S are created by coarse-grained transformations (applied to the entire dataset), but reads can be fine-grained (read from a specific location).
RDD’s are restricted to applications performing bulk writes, but provide efficient fault tolerance.
RDD’s are read-only, so Spark can run backup copies of slow tasks without accessing the same memory. Spark can distribute data over different nodes to run computations in parallel.
Runtime can schedule tasks based on data locality to improve performance.
RDD’s degrade gracefully when there isn’t enough memory to store them.

How well did you know this?

Not at all

Perfectly

RDD: Applications

They’re great for batch applications that perform the same operations on the entire dataset.

RDD’s aren’t suitable for applications that make asynchronous fine-grained updates to the shared state.

How well did you know this?

Not at all

Perfectly

Spark Programming Interface

The driver programme defines the RDD’s, invokes actions on them, and tracks the RDD’s lineage. The processes read data blocks from the distributed file system and store RDD partitions in RAM. The processes comprise of workers which receive tasks from the driver programme.

How well did you know this?

Not at all

Perfectly

Spark’s Cluster Mode Components

Cluster Manager: allocates resources across applications.
Driver Programme: must listen for and accept incoming connections from its executors throughout its lifetime. Should run close to workers, meaning it should be in the same local area network.
Executor: this is a process that performs computations and stores data.
Task: unit of work that’ll be sent to the executor.
SparkContext: connects to the cluster manager, acquires executors, sends application code and tasks to executors.

How well did you know this?

Not at all

Perfectly

pySpark

Study These Flashcards

This is Spark’s Python interface.

pySpark Commands: Start Shell

Study These Flashcards

Start shell
./bin/pyspark –master local

pySpark Commands: Shart Shell With k Worker Threads

Study These Flashcards

./bin/pyspark –master local[k]
(ideally, k = number of cores)

pySpark Commands: Create RDD

Study These Flashcards

tf = sc.textFile”file/////usr/share/dict/words”)
(tf is the point to the file. No loading is performed)
(sc is the SparkContext variable, which is created automatically)

pySpark Commands: Parallelised Collections

Study These Flashcards

d=[1,2,3,4,5]
parallel_col = sc.parallelize(d)
(sc is the SparkContext variable, which is created automatically)

pySpark Commands: Count

Study These Flashcards

parallel_col.count()
(parallel_col is a previously defined parallelised column)

pySpark Commands: Filter

Study These Flashcards

lines_nonempty = tf.filter(lambda x: len(x) > 0)
(tf is the point to the file. No loading is performed)

pySpark Commands: Map

Study These Flashcards

nums = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
print(num, “,”)
Output: 1,4,9,16

pySpark Commands: FlatMap

x = sc.parallelize(["a b", "c d"]) y = x.flatMap(lambda x: x.split(' ') Output: ['a', 'b', 'c', 'd']

pySpark Commands: Union

rddA.union(rddB)

pySpark Commands: Intersection

rddA.intersection(rddB)

pySpark Commands: Subtract

rddA.subtract(rddB)

pySpark Commands: Cartesian

rddA.cartesian(rddB)

pySpark Commands: Join

rddA.join(rddB,[number of reduce tasks])

pySpark Commands: Reduce

rdd = sc.parallelize([1,2,3,4,5]) sum = rdd.reduce(lambda x, y: x + y) sum Output:15

pySpark Commands: Fold

Works the same way as reduce, but first argument is type of output we want to return. rdd = sc.parallelize([1,2,3,4,5]) sum = rdd.fold(0.0, lambda x, y: x + y) sum Output:15

Accumulators

Accumulators only works for operations that are both associative and commutative (i.e. addition) Example: accum = sc.accumulator(0) accum Output: Accumulator sc.parallelize([1,2,3,4,5,6]).foreach(lambda x: accum.add(x)) accum.value Output: 15

Storage Strategy: MEMORY_ONLY

useDisk: False useMemory: True deserialised: False replication: 1

Storage Strategy: MEMORY_ONLY_2

useDisk: False useMemory: True deserialised: False replication: 2

Storage Strategy: MEMORY_ONLY_SER

useDisk: False useMemory: True deserialised: False replication: 1

Storage Strategy: DISK_ONLY

useDisk: True useMemory: False deserialised: False replication: 1

Storage Strategy: MEMORY_AND_DISK

useDisk: True useMemory: True deserialised: True replication: 1

Storage Strategy: MEMORY_AND_DISK_SER

useDisk: True useMemory: True deserialised: False replication: 1

Transformation

In Apache Spark, transformations modify an existing dataset to create a new RDD. It returns a dataset. They are executed in a lazy manner, which means that they are chained together to build a strategic execution plan.

Actions

In Apache Sparks, actions do the actual calculations and return counts, lists of elements, or nothing at all. Actions force the execution of the entire lineage of transformations leading up to the action.

Spark's Cluster Mode Workflow

SparkContext connects to the cluster manager. SparkContext communicates with the Executors.

Week 6: Apache Spark Flashcards

(42 cards)