Week 11 (IO patterns) Flashcards
why would we not want to load or store data the way hadoop does out of the box
inject data from original source w/o storing in hdfs
feeding MR output to next process
2 general ways to modify the way data is loaded on disk
input format: configure how contiguous chunks of input are generated from blocks in hdfs
record reader: configure how records appear in the map phase
2 general ways to modify the way data is stored on disk
what are the roles fo input format in hadoop
make sure data is there
split input blocks and files into logical chunks to be assigned to a map task
create record reader to be used to create key,val pairs from raw input split
what type of view does inputsplit represent of the split
byte-oriented
what is partition pruning
configure if files are loaded into MR based on name of file
what is the goal of a reccomendation sys
predict the rating or preference that a user would give to an item
what is collabaritive filtering
the process of identifying similar users and reccomending what similar users like
in collab filtering, when are users similar
if their vectors are close according to some distance measure (jaccard or cosine distance)
big n of collab filtering and then what it eventually ends up being
m = num of customers
n = num of product/catalog items
O(MN)
ends up being O(M+N)
what does item to item collab filtering do
matches each of the users purchased items to similar items
combines those into reccomendation list