C4: vector space model Flashcards

Question 1

Q

ranked retrieval (formal)

Answer

A

for each query Qi and document Dj, compute a score RSV(Qi, Dj): Retrieval Status Value
- at least ordinal scale
- value increases with relevance
- rank documents by RSV

Question 2

Q

set-based approach

Answer

A

query and document are both a set of terms => use Jaccard coefficient
RSV(Q,D) = Jaccard(terms(Q), terms(D))

Question 3

Q

Jaccard coefficient

Answer

A

Jaccard(A, B) = |A n B| / |A u B|
Jaccard distance = 1 - Jaccard similarity (satisfies triangle inequality, so can be used as a metric/distance function)

Question 4

Q

issues with Jaccard for scoring

Answer

A

we need to normalize for length (of the document or query)
Jaccard does not consider term frequency
Jaccard does not consider that rare tems in a collection are more informative than fequent terms

Question 5

Q

term-weighting approach

Answer

A

measures term importance in documents

Question 6

Q

log frequency of term t in d

Answer

A

w_t,d = 1 + log(tf_t,d)
quantitative evidence of d being relevant for query t

document score for a query: RSV = sum_(t in q) w_t,d

Question 7

Q

document frequency df_t

Answer

A

the number of documents that contain t: inverse measure of the informativeness of t to the relation of the document identity

Question 8

Q

inverse document frequency

Answer

A

idf_t = log(N / df_t)
quantifies how surprised we are to see term t in a document

Question 9

Q

tf.idf

Answer

A

tf.idf(t,d) = (1 + log(tf_t,d)) x log(N / df_t)
so it increases with the amount of times t occurs in d and with the rarity of the term in the collection => evidence of d being relevant when looking for t

to find the relevance for a total query, sum tf.idf for each term in the query

Question 10

Q

document ranking in vector space model

Answer

A

represent queries and documents as sparse vectors (dimensions are terms)

rank documents according to their proximity to the query in the vector space

use cosine similarity (not Euclidean distance, because this is disturbed by the vector length)

Question 11

Q

cosine similarity

Answer

A

based on the angle between vectors q and d
cos(q,d) = dot product(q,d) / |q|x|d|

Question 12

Q

vector space ranking with term weighting

Answer

A

represent each document and query as weighted tf-idf vector
compute cosine similarity score for the query vector and each document vector
rank documents with respect to the query by score
return top K to the user

Question 13

Q

scope vs. verbosity hypothesis

Answer

A

scope: a long document consists of a number of unrelated short documents concatenated together (covering more topics)

verbosity: a long document covers a similar scope to a short document, but uses more words (more of the same)

C4: vector space model Flashcards

(13 cards)