W4 Vector Space Model Flashcards
- What is Ranked Retrieval?
- Formalization Definition of Ranked Retrieval
- Also called Relevance Ranking. The system returns an ordering of the (top) documents in the collection for a query.
- For each query Qi and doc Dj, compute a score RSV(Qi, Dj)
* RSV: Retrieval Status Value
* RSV theory neural, could be distance based, probability based
* Ordinal scale (supports > < =)
* Value increase with relevance
1 What’s set-based approach for Ranked Retrieval?
Query and doc are each a set of terms
Give the calculation of Jaccard Coefficient
jaccard (A,B) = |A ∩ B| / |A ∪ B|
|A ∩ B|: Commonality
|A ∪ B|: Length Normalization
A and B don’t have to be the same size
We always assigns a number from [0, 1]
RSV(Q,D) = jaccard(terms(Q), terms(D))
What’s the query-doc match score computed by jaccard coefficient for each of the 2 docs: (common stop words are removed):
Query: Ides of March -> {ide, march}
Doc 1: Caesar died in March -> {caesar, die, march}
Doc 2: the long march -> {long, march}
jaccard (A,B) = |A ∩ B| / |A ∪ B|
jaccard ({ide, march}, {caesar, dy, march}) = 1 / 4
jaccard ({ide, march}, {long, march}) = 1 / 3
Give calculation of:
Jaccard Distance?
Dice coefficient?
Jaccard Distance = 1 - Jaccard Similarity
Dice(A, B) = 2x|A ∩ B| / (|A| + |B| )=2J/(J+1)
What is a metric?
In fact a distance function.
* identity: d(x, y)=0 => x=y
* Symmetry: d(x, y) = d(y, x)
* Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
* Non negativity: d(x, y) >= 0
What are the issues with Jaccard for scoring?
- we need to normalize for length (of the document or query)
- Jaccard does not consider term frequency
- Jaccard does not consider that rare tems in a collection are more informative than frequent terms
2 What’s term-weighting approach for Ranked Retrieval?
Measure term importance in docs
Vector Space Model (VSM)
What is bag of words model?
multiset of words, we don’t care the ordering in a doc.
john loves mary
mary loves john
they have the same vectors
The positional index can distinguish two docs
What is term frequency (TF)?
The term frequency tf (t,d) of term t in document d is defined as
the number of times that t occurs in d.
raw term frequency is not what we want because (Relevance does not increase proportionally with term frequency) so we use log normalization:
The log frequency weight of term t in d is:
w_(t, d) = 1 + log(tf) if tf>0
= 0 else
w is the quantitative evidence of d being relevant for query t
score a doc for a query: sum the w
What is document frequency (DF)? How do we interpret it?
df(t) is the document frequency of t: the number of
documents that contain t
Rare terms are more informative than frequent terms (eg. stop words and the word arachnocentric in a doc) A document containing arachnocentric is more likely to be relevant than a document that does not contain it
- What is Inverse Document Frequency (IDF)?
- How to calculate TF-IDF?
- idf = log (N/df)
N: number of docs in the collection
idf quantifies how surprised we are to see term t in a doc, again, logarithm is used for better approximation
- TF-IDF = TF * IDF
TF-IDF increases with number of occurances within the doc, and with rarity of the term in the collection.
TF-IDF is evidence of d being relevant when looking for t
Document ranking in VSM
Docs are vectors of term weights in the space
The vector space is |V|-dimensional (|V|: no. of terms in the vocabulary)
terms are axes of the space
***** Vectors are High-dimensional, sparse
Queries as vectors too, how do we rank?
- represent them as vectos in space, rank docs according to their proximity to the query in the space = similarity of the vectors
The angle between vectors shows the similarity
Give formula for cosine similarity (dot product) of two vectors
cos(q, d) = (q * d) /(|q|*|d|)
= \sum (qi * di) / \sqrt (\sum qi^2) * \sqrt (\sum di^2)
q, d: length-normalized: ||x|| = \sqrt (\sum xi^2)