IDT lecture 5 Flashcards

1
Q

Data flow model: benefits

A

increase performance by better optimizing the memory,

ex: apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Apace Spark: what is it?

A

A different implementation of MapReduce by Berkley.

Big diff: works in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Lambda Expressions: what is it?

A

Small functions without a name.
Any number of arguments
- only one expression is executed

ex: lambda a,b,c : a + b+ c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Storage layer

A

requirements:
- scalability
- efficiency : speed
- simplicity
- fault tolerance

Spark vs Hadoop: in spark, data is stored in memory and in hadoop is stored on HDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDD: meaning

A

Resilient Distributed Datasets (RDDs)

  • D: dataset :collection of data
  • D: distributed: parts are placed on different computers
  • R: resilient: recover from failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RDD: properties

A

Core properties:

  • immutable (read-only) cannot be changed. needs new rdd for changes
  • distributed
  • lazily evaluated
  • cacheable (by default in memory)
  • replicated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do RDDs contain?

A
Details about the data: data location or actual data
Lineage information (history)
RDD2 = RDD1.Function()
RDD3 = RDD2.Function()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spark: dataflow and RDDs

A

RDDs enable operations:
Transformation (lazy operations): map, filter, joins etc
Actions: returns, shows, saves values

Chain of RDD transformations to implement the required functionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spark transformations

A

Create a new RDD from an existing one (cannot modify rdd)

All transformations in Spark are Lazy:
- transformations are not finalized until we need them. Execution of transformation comes only when we do Actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Map transformation

A

Returns a new RDD formed by passing each element of the source thorough the given function.

ex: RDD.map(f)
like a for each

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Filter transformation:

A

Returns a new RDD formed by Keeping those elements of the source on which function returns true.

rdd.filter(f)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

FlatMap transformation

A

Similar to map transformation but each input element can be mapped to 0 or more output elements.

rdd.flatmap(f)

RDD John Doe -> Rdd1 = John ; rdd2 = Doe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ReduceByKey transformation

A

processes the elements with each being an (K,V) pair

creates another set of (K, V), pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

groupByKey transformation

A

Processes the elemtnts with each being an (K,V) pair

[… refer to slides…]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Actions (most frequent ones)

A

RDD> trans 1> trans 2 > … > Action

.collect() -> retrieve RDD contents as a local collections

.take( k ) -> returns first K elements

.count() -> count the number of elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parallelizationj and synchronization in Spark

A

Actions and transf on RDDs are fully parallelizable

Synchronization required only on SHUFFLING!

17
Q

Lazy evaluation in Spark

A

Spark: static rule-based optimizations

  • transformations are not completed until an action is called.