C4: vector space model Flashcards

1
Q

ranked retrieval (formal)

A

for each query Qi and document Dj, compute a score RSV(Qi, Dj): Retrieval Status Value
- at least ordinal scale
- value increases with relevance
- rank documents by RSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

set-based approach

A

query and document are both a set of terms => use Jaccard coefficient
RSV(Q,D) = Jaccard(terms(Q), terms(D))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Jaccard coefficient

A

Jaccard(A, B) = |A n B| / |A u B|
Jaccard distance = 1 - Jaccard similarity (satisfies triangle inequality, so can be used as a metric/distance function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

issues with Jaccard for scoring

A
  1. we need to normalize for length (of the document or query)
  2. Jaccard does not consider term frequency
  3. Jaccard does not consider that rare tems in a collection are more informative than fequent terms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

term-weighting approach

A

measures term importance in documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

log frequency of term t in d

A

w_t,d = 1 + log(tf_t,d)
quantitative evidence of d being relevant for query t

document score for a query: RSV = sum_(t in q) w_t,d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

document frequency df_t

A

the number of documents that contain t: inverse measure of the informativeness of t to the relation of the document identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

inverse document frequency

A

idf_t = log(N / df_t)
quantifies how surprised we are to see term t in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

tf.idf

A

tf.idf(t,d) = (1 + log(tf_t,d)) x log(N / df_t)
so it increases with the amount of times t occurs in d and with the rarity of the term in the collection => evidence of d being relevant when looking for t

to find the relevance for a total query, sum tf.idf for each term in the query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

document ranking in vector space model

A

represent queries and documents as sparse vectors (dimensions are terms)

rank documents according to their proximity to the query in the vector space

use cosine similarity (not Euclidean distance, because this is disturbed by the vector length)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

cosine similarity

A

based on the angle between vectors q and d
cos(q,d) = dot product(q,d) / |q|x|d|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

vector space ranking with term weighting

A
  • represent each document and query as weighted tf-idf vector
  • compute cosine similarity score for the query vector and each document vector
  • rank documents with respect to the query by score
  • return top K to the user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

scope vs. verbosity hypothesis

A

scope: a long document consists of a number of unrelated short documents concatenated together (covering more topics)

verbosity: a long document covers a similar scope to a short document, but uses more words (more of the same)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly