Week 1&2 Flashcards
What are 2 things that can be done at large scale that cannot be done at small scale
extract new insights
create new forms of values
four V’s in big data
Volume
Velocity ( how fast is data coming in and how fast do you have to analyze and use it)
Variety (# of sources)
Veracity(can you trust data/source/process .. data cleaning .. user entry errors)
Why was mapreduce made
to provide an abstraction that allows engineers to preform simple computations while hiding the details of parralellizations, data distribution, load balancing and fault tolerance
what does the mapper do
maps input key/val pairs to another set of intermediary key/vals (may map to zero or many output pairs)
what does the reducer do
reduces a set of intermediary vals which share a key to a smaller set of values
what are the 3 phases of the reducer
shuffle, sort, reduce
what are somne challenges that mapreduce solves/still has
dividing the work into equal size pieces, limited by slowest node, combing results when done
what does the programmer need to specify in map and reduce
the type of input and output key vals
map function
reduce function
how will this be reduced to?