C4: vector space model Flashcards
ranked retrieval (formal)
for each query Qi and document Dj, compute a score RSV(Qi, Dj): Retrieval Status Value
- at least ordinal scale
- value increases with relevance
- rank documents by RSV
set-based approach
query and document are both a set of terms => use Jaccard coefficient
RSV(Q,D) = Jaccard(terms(Q), terms(D))
Jaccard coefficient
Jaccard(A, B) = |A n B| / |A u B|
Jaccard distance = 1 - Jaccard similarity (satisfies triangle inequality, so can be used as a metric/distance function)
issues with Jaccard for scoring
- we need to normalize for length (of the document or query)
- Jaccard does not consider term frequency
- Jaccard does not consider that rare tems in a collection are more informative than fequent terms
term-weighting approach
measures term importance in documents
log frequency of term t in d
w_t,d = 1 + log(tf_t,d)
quantitative evidence of d being relevant for query t
document score for a query: RSV = sum_(t in q) w_t,d
document frequency df_t
the number of documents that contain t: inverse measure of the informativeness of t to the relation of the document identity
inverse document frequency
idf_t = log(N / df_t)
quantifies how surprised we are to see term t in a document
tf.idf
tf.idf(t,d) = (1 + log(tf_t,d)) x log(N / df_t)
so it increases with the amount of times t occurs in d and with the rarity of the term in the collection => evidence of d being relevant when looking for t
to find the relevance for a total query, sum tf.idf for each term in the query
document ranking in vector space model
represent queries and documents as sparse vectors (dimensions are terms)
rank documents according to their proximity to the query in the vector space
use cosine similarity (not Euclidean distance, because this is disturbed by the vector length)
cosine similarity
based on the angle between vectors q and d
cos(q,d) = dot product(q,d) / |q|x|d|
vector space ranking with term weighting
- represent each document and query as weighted tf-idf vector
- compute cosine similarity score for the query vector and each document vector
- rank documents with respect to the query by score
- return top K to the user
scope vs. verbosity hypothesis
scope: a long document consists of a number of unrelated short documents concatenated together (covering more topics)
verbosity: a long document covers a similar scope to a short document, but uses more words (more of the same)