Week 5 Flashcards
Basic Idea of Vector Space Model
- Relevance is same as similarity
- if a document is similar to the query it has higher relevance
Vector Space Model Framework
- Represent document/query by term vector
Query vector q is query term weight
Document vector d is document term weight
How is relevance measured in VSM
based on similarity of two vectors
What is a bit vector representation
1 word is present, 0 word is absent
What is Similarity instantiation Dot Product?
- Common method to measure similarity
- Similarity between vector and query
- Yields a score to rank
Dot Product example
Get all words in V list
create query vector with V words with 1
create document vector with words given 1/0 for each document
compute dot product
Problem of VSM
- should terms have same weight
- should some terms be more important
- should term frequency be considered
Improved VSM instantiation
- make specific term more important
- add in term frequency vector weighting
Improved VSM example
Rank with TF weighting
Get query in vector term frequency and each document term frequency then compute dot product
How to penalize popular terms in VSM
IDF weighting function is smoothed by log IDF = log(M+1/k) M = docs in collection K = docs containing word
- now compare dot product my multiplying word vector value or count with IDF * count
VSM with BM25
multiply vector by TF transformation of (K+1) c(w,d) / c(w,d) + k X IDF
ranking function = C(w,q) * Above
What is the impact of document length to TR
- Penalize long document with long document normalizer
Why is a document longer
document has more words -> meaningless, need more penalty
document has more contents -> meaningful, need less penalty
How to normalize/penalize doc length
Use pivot length normalizer - Normalizer = 1 if document length is average
Otherwise Normalizer = 1 - b + b (document length/ average document length)