Week 6 (large scale data analysis using mapreduce Flashcards
what are these examples of?
data:image/s3,"s3://crabby-images/3eb01/3eb019dc053500fae81458513b421a17805c7c0e" alt=""
distance measures
what is the goal behind link analysis
understanding large problem w/ unstructured data
how did early search engines work
crawled the web and listed the terms in an inverted index
what is an inverted index in the the context of storage of terms
data structure which makes it easy, given a term, to find pointer to all places where that term appears
create an inverted index for the given texts
data:image/s3,"s3://crabby-images/28cf3/28cf31023081e5946f75a08bc7a6a75f4abc765b" alt=""
data:image/s3,"s3://crabby-images/6a187/6a187229c59bd2d2552d1806d43d0a39cbaf93bd" alt=""
a term search for the terms Colorado State University, would give what set
data:image/s3,"s3://crabby-images/5a388/5a388d9d80b625fff22046fc7901463a673926b0" alt=""
data:image/s3,"s3://crabby-images/7c9fe/7c9fe21b2b4267dd828429b459913fdc8927bb78" alt=""
what is term spam
data:image/s3,"s3://crabby-images/e46d3/e46d30b8d76f97a14c1a6745ec701f250ac5ac36" alt=""
goals behind a page rank algorithim
provide effective sumnmaries for search results
ordering/ranking results
why does a pagerank algorithim simulate random surfers
pages that would have a large number of surfers were considered more “important” than pages that would be barely visited
how is a page judged in a page rank algorithim (basic)
terms occuring on the page
and terms used in or near links to that page
what is the highest level definition of pagerank
function which assigns a real number to each page on the web
what does a higher pagerank of a page mean
the more important it is
given this graph, suppose a random surfer starts at page A, what are the probs that a surfer will be on each page in the next step
data:image/s3,"s3://crabby-images/d2944/d2944b1053f3f795d2f38ce1a35cc58d43a59cdf" alt=""
data:image/s3,"s3://crabby-images/393de/393de74623e81223e22f110238c1e1dc4cd5f489" alt=""
now suppose surfer at B, whats the prob of next step
data:image/s3,"s3://crabby-images/b9485/b94856dcc63bcb317923436fb391bdbc7da4f6f6" alt=""
data:image/s3,"s3://crabby-images/19e4f/19e4f80462103d50eed067ad246eecea90b8fec4" alt=""
for pagerank, what is the dimension of the transition matrix M
n columns and n rows
what does the transition matrix of this graph look like
data:image/s3,"s3://crabby-images/922b4/922b48fd6d9e708d2d8c0c385d6dfc3e205d8fa0" alt=""
data:image/s3,"s3://crabby-images/a323f/a323fdeb35389649d1709ce4ac4dd6fc5d87031c" alt=""
what does the first and second column of this transition matrix mean
data:image/s3,"s3://crabby-images/dae47/dae47b4db251fe77b62241714abfab92fb1c71e7" alt=""
data:image/s3,"s3://crabby-images/7cbc4/7cbc4e7c936a5347242a425621e5c6033e59ab28" alt=""
what does this matrix mean/represent?
data:image/s3,"s3://crabby-images/7377c/7377cbbfe112962d7c13e893f3a1443bf5d3e1d1" alt=""
data:image/s3,"s3://crabby-images/8d065/8d065f978544909eae8af87ef33658a66e1b3ce1" alt=""
for the transition matrix M, if we surf at any of the n page of web with equal prob. What will the inital vector look like and where is it located
1/n for each component
farthest left, first col vector (i think)
what does the matrix mean and what does the vector mean?
and what does this equal
data:image/s3,"s3://crabby-images/90177/9017739ac0ef0307f45160d2ed38f771b80d1887" alt=""
M = prob for next step from current location
Vec = probability for being in the current location
== distribution of the surfer after this step, probability for being in the next possible location
what is v_j and m_ij ?
data:image/s3,"s3://crabby-images/85a19/85a19ed0cf27fe71418ce51e3fc377617943e33f" alt=""
m_ij is prob that a surfer at node J will move to node i at the next step
v_j is ther prob that the surfer was at node j at prev step
what is x_i
data:image/s3,"s3://crabby-images/64fb5/64fb51fba5dcf69e9f96e7d213f00c15d3f62f2c" alt=""
prob that a random surfer will be at node i at the next step
what 2 conditions
data:image/s3,"s3://crabby-images/1240a/1240a4c4c1da3bdb0090d3d2f00b3c4a0a9e65e1" alt=""
graph is strongly connected
there are no dead ends (nodes that have no arcs out)
how many iters to have v_j converge
50-75 iters
data:image/s3,"s3://crabby-images/b0e8f/b0e8f81562bf145445b9e44d14c981fface21a6a" alt=""
data:image/s3,"s3://crabby-images/b6678/b6678ffa05e2858eba51af1f3f07eff82a954c43" alt=""
why do we need to do matrix vector multiplication in mapreduce with a sufficiently large dataset
vector v cannot fit in main memory
more than 1.4 billion pages
in matrix-vector multiplication, what will be stored in the files
matrix M and vector V stored seperatly
in matrix-vector multiplication, what does the map function do
data:image/s3,"s3://crabby-images/4cb33/4cb33d6700775fb9711ac34276c36217103a54f8" alt=""
in matrix-vector multiplication, what does the reduce function look like
data:image/s3,"s3://crabby-images/75b1b/75b1b80a19798609ee3bf4582c22aff42f82ec09" alt=""
in matrix-vector multiplication, what do you do if the vector v cant fit in main memory
data:image/s3,"s3://crabby-images/53512/53512c3459540daabe63352dfb1c10ae43f9db58" alt=""
in matrix-vector multiplication, what does stripe-ing look like
data:image/s3,"s3://crabby-images/b04c4/b04c42e79de57f4a206bd18a70dd35845f5a1bd8" alt=""
what is a dead end and why is it a problem
a page with no links out, surfers reaching page will dissapear
what is a spider trap
groups of pages that all have outlinks but never link to any other pages
what would happen if dead ends were allowed
data:image/s3,"s3://crabby-images/155ca/155cabdc6d0ee5c4ea46fc374133264bc1b763cf" alt=""
2 solutions for dealing with dead ends
data:image/s3,"s3://crabby-images/22630/226309c5e4cc846440205d0a95a6f1460fd10fe2" alt=""
in recursive deletion, on the pages that are deleted, whats their page rank
the parent, or if the parent is deleted then its the parent of that
it just goes down the tree
with recursive deletion, the sums will eventually exceed 1. how does this change the definition of our matrix
data:image/s3,"s3://crabby-images/38b64/38b642b735fb05c7239f92776562a6ba755ab714" alt=""