IDT lecture 5 Flashcards
Data flow model: benefits
increase performance by better optimizing the memory,
ex: apache Spark
Apace Spark: what is it?
A different implementation of MapReduce by Berkley.
Big diff: works in memory
Lambda Expressions: what is it?
Small functions without a name.
Any number of arguments
- only one expression is executed
ex: lambda a,b,c : a + b+ c
Storage layer
requirements:
- scalability
- efficiency : speed
- simplicity
- fault tolerance
Spark vs Hadoop: in spark, data is stored in memory and in hadoop is stored on HDD
RDD: meaning
Resilient Distributed Datasets (RDDs)
- D: dataset :collection of data
- D: distributed: parts are placed on different computers
- R: resilient: recover from failures
RDD: properties
Core properties:
- immutable (read-only) cannot be changed. needs new rdd for changes
- distributed
- lazily evaluated
- cacheable (by default in memory)
- replicated
What do RDDs contain?
Details about the data: data location or actual data Lineage information (history)
RDD2 = RDD1.Function() RDD3 = RDD2.Function()
Spark: dataflow and RDDs
RDDs enable operations:
Transformation (lazy operations): map, filter, joins etc
Actions: returns, shows, saves values
Chain of RDD transformations to implement the required functionality
Spark transformations
Create a new RDD from an existing one (cannot modify rdd)
All transformations in Spark are Lazy:
- transformations are not finalized until we need them. Execution of transformation comes only when we do Actions.
Map transformation
Returns a new RDD formed by passing each element of the source thorough the given function.
ex: RDD.map(f)
like a for each
Filter transformation:
Returns a new RDD formed by Keeping those elements of the source on which function returns true.
rdd.filter(f)
FlatMap transformation
Similar to map transformation but each input element can be mapped to 0 or more output elements.
rdd.flatmap(f)
RDD John Doe -> Rdd1 = John ; rdd2 = Doe
ReduceByKey transformation
processes the elements with each being an (K,V) pair
creates another set of (K, V), pairs
groupByKey transformation
Processes the elemtnts with each being an (K,V) pair
[… refer to slides…]
Actions (most frequent ones)
RDD> trans 1> trans 2 > … > Action
.collect() -> retrieve RDD contents as a local collections
.take( k ) -> returns first K elements
.count() -> count the number of elements