IDT lecture 5 Flashcards

Question 1

Q

Data flow model: benefits

Answer

A

increase performance by better optimizing the memory,

ex: apache Spark

Question 2

Q

Apace Spark: what is it?

Answer

A

A different implementation of MapReduce by Berkley.

Big diff: works in memory

Question 3

Q

Lambda Expressions: what is it?

Answer

A

Small functions without a name.
Any number of arguments
- only one expression is executed

ex: lambda a,b,c : a + b+ c

Question 4

Q

Storage layer

Answer

A

requirements:
- scalability
- efficiency : speed
- simplicity
- fault tolerance

Spark vs Hadoop: in spark, data is stored in memory and in hadoop is stored on HDD

Question 5

Q

RDD: meaning

Answer

A

Resilient Distributed Datasets (RDDs)

D: dataset :collection of data
D: distributed: parts are placed on different computers
R: resilient: recover from failures

Question 6

Q

RDD: properties

Answer

A

Core properties:

immutable (read-only) cannot be changed. needs new rdd for changes
distributed
lazily evaluated
cacheable (by default in memory)
replicated

Question 7

Q

What do RDDs contain?

Answer

A

Details about the data: data location or actual data
Lineage information (history)

RDD2 = RDD1.Function()
RDD3 = RDD2.Function()

Question 8

Q

Spark: dataflow and RDDs

Answer

A

RDDs enable operations:
Transformation (lazy operations): map, filter, joins etc
Actions: returns, shows, saves values

Chain of RDD transformations to implement the required functionality

Question 9

Q

Spark transformations

Answer

A

Create a new RDD from an existing one (cannot modify rdd)

All transformations in Spark are Lazy:
- transformations are not finalized until we need them. Execution of transformation comes only when we do Actions.

Question 10

Q

Map transformation

Answer

A

Returns a new RDD formed by passing each element of the source thorough the given function.

ex: RDD.map(f)
like a for each

Question 11

Q

Filter transformation:

Answer

A

Returns a new RDD formed by Keeping those elements of the source on which function returns true.

rdd.filter(f)

Question 12

Q

FlatMap transformation

Answer

A

Similar to map transformation but each input element can be mapped to 0 or more output elements.

rdd.flatmap(f)

RDD John Doe -> Rdd1 = John ; rdd2 = Doe

Question 13

Q

ReduceByKey transformation

Answer

A

processes the elements with each being an (K,V) pair

creates another set of (K, V), pairs

Question 14

Q

groupByKey transformation

Answer

A

Processes the elemtnts with each being an (K,V) pair

[… refer to slides…]

Question 15

Q

Actions (most frequent ones)

Answer

A

RDD> trans 1> trans 2 > … > Action

.collect() -> retrieve RDD contents as a local collections

.take( k ) -> returns first K elements

.count() -> count the number of elements

Question 16

Q

Parallelizationj and synchronization in Spark

Answer

Study These Flashcards

A

Actions and transf on RDDs are fully parallelizable

Synchronization required only on SHUFFLING!

Question 17

Q

Lazy evaluation in Spark

Answer

Study These Flashcards

A

Spark: static rule-based optimizations

transformations are not completed until an action is called.

IDT lecture 5 Flashcards

(17 cards)