Week 5 Flashcards
Is the input to every reducer sorted?
yes
What is the Jaccard coeffecient measuring
data:image/s3,"s3://crabby-images/37aa5/37aa50a1997eff953a78a1b777497ea70036943d" alt=""
what does a higher jaccard similarity mean
higher fraction of data is shared between P and Q
data:image/s3,"s3://crabby-images/67227/67227f53744d16b322cd211d7d92df327f07b0f7" alt=""
data:image/s3,"s3://crabby-images/5b9e0/5b9e0be4c518164a5cb01c9f0655362b57562b71" alt=""
what is a simple deffeciency of jaccard similarity
data:image/s3,"s3://crabby-images/7f025/7f025facc8fc22f7d8ef117af78254f4a03dfb54" alt=""
in cosine similarity, what are v and w
data:image/s3,"s3://crabby-images/fa30b/fa30bb5dccf998c431e0e6d0ca68484aed4cdbb1" alt=""
data:image/s3,"s3://crabby-images/ca754/ca754f2686bed03a6d1e44cd6eef8efd5bc5e70a" alt=""
how do you measure the weight of a token
term frequency - inverse document frequency. (tf-idf)
term freqency equation
data:image/s3,"s3://crabby-images/377ef/377ef856c582aaed1135f94c7f9d622d52be14c9" alt=""
data:image/s3,"s3://crabby-images/b7e88/b7e88fe922406af91104b9c06fc550250292f538" alt=""
1
data:image/s3,"s3://crabby-images/e8db9/e8db927d731a728ef0f609b986adf26ead4fec18" alt=""
data:image/s3,"s3://crabby-images/a614f/a614fa20eab8894b78bd1c4feb8657bf643be7e7" alt=""
what 2 things can cause the tf-idf to go up for a word
increases in the num of occurances within a document
if the word is more rare in the collection of documents
data:image/s3,"s3://crabby-images/21a71/21a712caa119c4dc4661b3a065dd1fdc1f1836ed" alt=""
data:image/s3,"s3://crabby-images/c6b73/c6b735c33010ff303146226a5abd648ea65122d7" alt=""
what is a safe n number for n grams in research
n=9
What is an appliaction of near neighbor search
data:image/s3,"s3://crabby-images/a4162/a41629c32798d7dffec0fb4cbc5cb11501dd2418" alt=""
what is a minhash of a set
data:image/s3,"s3://crabby-images/0b249/0b249376a9fb1d56fbd1ce997e5257d0fabe4799" alt=""
jaccard similarity of two sets in english
the size of their intersection divided by the size of their union
how is minhash related to jaccard similarity
the prob(over all permutations of the rows) that h(c1) = h(c2) is the same as jaccard_sim(c1,c2)