Week 6: Apache Spark Flashcards

1
Q

Apache Spark

A

It’s a unified engine for distributed data processing. Spark extends the MapReduce programming model with abstraction that allows efficient data reuse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Resilient Distributed Datasets (RDD’s)

A

It allows for efficient data reuse. RDD’s are read-only, fault-tolerant collections of records that can be operated in parallel. They’re data structures that serve as the core unit of data in Spark. It can be manipulated, its partitioning controlled, and can be made persistent in memory. All RDD’s may or may not be materialised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDD: Lineage

A

Each RDD knows its lineage, which is how it was derived from other datasets, and can compute its partitions from data in stable storage. Only RDD’s that can be reconstructed after failure can be referenced by the user’s programme.

Representation can track lineage and provide transformations that can be composed arbitrarily. A representation includes a set of partitions, a set of dependencies on the parent RDD’s, a function for computing the dataset based on its parents, and the metadata about the partitioning scheme and data placement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RDD: Persistence

A

Users can specify which RDD’s they’ll reuse and choose storage strategies for them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDD: Partitioning

A

Users can specify how records of RDD’s are partitioned across machines based on the key in each record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Filter Example

rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x % 2 == 0).collect()

A

Output
[2,4]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Map Example

rdd = sc.parallelize([2,3,4])
rdd.map(lambda x: range(1,x)).collect()

A

Output
[[1],[1,2],[1,2,3]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

FlatMap Example

rdd = sc.parallelize([2,3,4])
rdd.flatMap(lambda x: range(1,x)).collect()

A

Output
[1,1,2,1,2,3,]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

RDD: Storage Strategies

A

By default, RDD’s are stored in RAM. If there’s not enough RAM, the RDD’s are spilled to the disk. Users can also store RDD’s only on the disk, replicate RDD’s across machines and use flags to persist, or set the persistence priority on each RDD to specify which in-memory data should spill to the disk first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

RDD: Lazy Evaluation

A

Spark executes only after seeing the first action, and it lazy evaluates transformations by recording metadata for them. Lazy evaluation provides a global view, so Spark can optimise the required calculations by grouping operations together and recovers from failures and slow workers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

RDD: Narrow Dependencies

A

Each partition of the parent RDD is used by at most 1 partition of the child RDD. It allows for pipeline execution on the node. It has efficient failure recovery, as only lost parent partitions need to be recomputed and recomputation can be parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

RDD: Wide Dependencies

A

Multiple child partitions may depend on a single partition of the parent RDD. It requires data from all parent partitions, in order to be available and shuffled across nodes. Failure recovery involves many RDD’s and complete re-execution may be needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RDD: Advantages

A
  1. Distributed Shared Memory (DSM), which is a global address space that applications can read and write to in arbitrary locations.
  2. RDD’S are created by coarse-grained transformations (applied to the entire dataset), but reads can be fine-grained (read from a specific location).
  3. RDD’s are restricted to applications performing bulk writes, but provide efficient fault tolerance.
  4. RDD’s are read-only, so Spark can run backup copies of slow tasks without accessing the same memory. Spark can distribute data over different nodes to run computations in parallel.
  5. Runtime can schedule tasks based on data locality to improve performance.
  6. RDD’s degrade gracefully when there isn’t enough memory to store them.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RDD: Applications

A

They’re great for batch applications that perform the same operations on the entire dataset.

RDD’s aren’t suitable for applications that make asynchronous fine-grained updates to the shared state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Spark Programming Interface

A

The driver programme defines the RDD’s, invokes actions on them, and tracks the RDD’s lineage. The processes read data blocks from the distributed file system and store RDD partitions in RAM. The processes comprise of workers which receive tasks from the driver programme.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Spark’s Cluster Mode Components

A
  1. Cluster Manager: allocates resources across applications.
  2. Driver Programme: must listen for and accept incoming connections from its executors throughout its lifetime. Should run close to workers, meaning it should be in the same local area network.
  3. Executor: this is a process that performs computations and stores data.
  4. Task: unit of work that’ll be sent to the executor.
  5. SparkContext: connects to the cluster manager, acquires executors, sends application code and tasks to executors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

pySpark

A

This is Spark’s Python interface.

18
Q

pySpark Commands: Start Shell

A

Start shell
./bin/pyspark –master local

19
Q

pySpark Commands: Shart Shell With k Worker Threads

A

./bin/pyspark –master local[k]
(ideally, k = number of cores)

20
Q

pySpark Commands: Create RDD

A

tf = sc.textFile”file/////usr/share/dict/words”)
(tf is the point to the file. No loading is performed)
(sc is the SparkContext variable, which is created automatically)

21
Q

pySpark Commands: Parallelised Collections

A

d=[1,2,3,4,5]
parallel_col = sc.parallelize(d)
(sc is the SparkContext variable, which is created automatically)

22
Q

pySpark Commands: Count

A

parallel_col.count()
(parallel_col is a previously defined parallelised column)

23
Q

pySpark Commands: Filter

A

lines_nonempty = tf.filter(lambda x: len(x) > 0)
(tf is the point to the file. No loading is performed)

24
Q

pySpark Commands: Map

A

nums = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
print(num, “,”)
Output: 1,4,9,16

25
Q

pySpark Commands: FlatMap

A

x = sc.parallelize([“a b”, “c d”])
y = x.flatMap(lambda x: x.split(‘ ‘)
Output: [‘a’, ‘b’, ‘c’, ‘d’]

26
Q

pySpark Commands: Union

A

rddA.union(rddB)

27
Q

pySpark Commands: Intersection

A

rddA.intersection(rddB)

28
Q

pySpark Commands: Subtract

A

rddA.subtract(rddB)

29
Q

pySpark Commands: Cartesian

A

rddA.cartesian(rddB)

30
Q

pySpark Commands: Join

A

rddA.join(rddB,[number of reduce tasks])

31
Q

pySpark Commands: Reduce

A

rdd = sc.parallelize([1,2,3,4,5])
sum = rdd.reduce(lambda x, y: x + y)
sum
Output:15

32
Q

pySpark Commands: Fold

A

Works the same way as reduce, but first argument is type of output we want to return.

rdd = sc.parallelize([1,2,3,4,5])
sum = rdd.fold(0.0, lambda x, y: x + y)
sum
Output:15

33
Q

Accumulators

A

Accumulators only works for operations that are both associative and commutative (i.e. addition)

Example:
accum = sc.accumulator(0)
accum
Output: Accumulator<id=0, value=0>
sc.parallelize([1,2,3,4,5,6]).foreach(lambda x: accum.add(x))
accum.value
Output: 15

34
Q

Storage Strategy: MEMORY_ONLY

A

useDisk: False
useMemory: True
deserialised: False
replication: 1

35
Q

Storage Strategy: MEMORY_ONLY_2

A

useDisk: False
useMemory: True
deserialised: False
replication: 2

36
Q

Storage Strategy: MEMORY_ONLY_SER

A

useDisk: False
useMemory: True
deserialised: False
replication: 1

37
Q

Storage Strategy: DISK_ONLY

A

useDisk: True
useMemory: False
deserialised: False
replication: 1

38
Q

Storage Strategy: MEMORY_AND_DISK

A

useDisk: True
useMemory: True
deserialised: True
replication: 1

39
Q

Storage Strategy: MEMORY_AND_DISK_SER

A

useDisk: True
useMemory: True
deserialised: False
replication: 1

40
Q

Transformation

A

In Apache Spark, transformations modify an existing dataset to create a new RDD. It returns a dataset. They are executed in a lazy manner, which means that they are chained together to build a strategic execution plan.

41
Q

Actions

A

In Apache Sparks, actions do the actual calculations and return counts, lists of elements, or nothing at all. Actions force the execution of the entire lineage of transformations leading up to the action.

42
Q

Spark’s Cluster Mode Workflow

A

SparkContext connects to the cluster manager.

SparkContext communicates with the Executors.