W4 Vector Space Model Flashcards
- What is Ranked Retrieval?
- Formalization Definition of Ranked Retrieval
- Also called Relevance Ranking. The system returns an ordering of the (top) documents in the collection for a query.
- For each query Qi and doc Dj, compute a score RSV(Qi, Dj)
* RSV: Retrieval Status Value
* RSV theory neural, could be distance based, probability based
* Ordinal scale (supports > < =)
* Value increase with relevance
1 What’s set-based approach for Ranked Retrieval?
Query and doc are each a set of terms
Give the calculation of Jaccard Coefficient
jaccard (A,B) = |A ∩ B| / |A ∪ B|
|A ∩ B|: Commonality
|A ∪ B|: Length Normalization
A and B don’t have to be the same size
We always assigns a number from [0, 1]
RSV(Q,D) = jaccard(terms(Q), terms(D))
What’s the query-doc match score computed by jaccard coefficient for each of the 2 docs: (common stop words are removed):
Query: Ides of March -> {ide, march}
Doc 1: Caesar died in March -> {caesar, die, march}
Doc 2: the long march -> {long, march}
jaccard (A,B) = |A ∩ B| / |A ∪ B|
jaccard ({ide, march}, {caesar, dy, march}) = 1 / 4
jaccard ({ide, march}, {long, march}) = 1 / 3
Give calculation of:
Jaccard Distance?
Dice coefficient?
Jaccard Distance = 1 - Jaccard Similarity
Dice(A, B) = 2x|A ∩ B| / (|A| + |B| )=2J/(J+1)
What is a metric?
In fact a distance function.
* identity: d(x, y)=0 => x=y
* Symmetry: d(x, y) = d(y, x)
* Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
* Non negativity: d(x, y) >= 0
What are the issues with Jaccard for scoring?
- we need to normalize for length (of the document or query)
- Jaccard does not consider term frequency
- Jaccard does not consider that rare tems in a collection are more informative than frequent terms
2 What’s term-weighting approach for Ranked Retrieval?
Measure term importance in docs
Vector Space Model (VSM)
What is bag of words model?
multiset of words, we don’t care the ordering in a doc.
john loves mary
mary loves john
they have the same vectors
The positional index can distinguish two docs
What is term frequency (TF)?
The term frequency tf (t,d) of term t in document d is defined as
the number of times that t occurs in d.
raw term frequency is not what we want because (Relevance does not increase proportionally with term frequency) so we use log normalization:
The log frequency weight of term t in d is:
w_(t, d) = 1 + log(tf) if tf>0
= 0 else
w is the quantitative evidence of d being relevant for query t
score a doc for a query: sum the w
What is document frequency (DF)? How do we interpret it?
df(t) is the document frequency of t: the number of
documents that contain t
Rare terms are more informative than frequent terms (eg. stop words and the word arachnocentric in a doc) A document containing arachnocentric is more likely to be relevant than a document that does not contain it
- What is Inverse Document Frequency (IDF)?
- How to calculate TF-IDF?
- idf = log (N/df)
N: number of docs in the collection
idf quantifies how surprised we are to see term t in a doc, again, logarithm is used for better approximation
- TF-IDF = TF * IDF
TF-IDF increases with number of occurances within the doc, and with rarity of the term in the collection.
TF-IDF is evidence of d being relevant when looking for t
Document ranking in VSM
Docs are vectors of term weights in the space
The vector space is |V|-dimensional (|V|: no. of terms in the vocabulary)
terms are axes of the space
***** Vectors are High-dimensional, sparse
Queries as vectors too, how do we rank?
- represent them as vectos in space, rank docs according to their proximity to the query in the space = similarity of the vectors
The angle between vectors shows the similarity
Give formula for cosine similarity (dot product) of two vectors
cos(q, d) = (q * d) /(|q|*|d|)
= \sum (qi * di) / \sqrt (\sum qi^2) * \sqrt (\sum di^2)
q, d: length-normalized: ||x|| = \sqrt (\sum xi^2)
@#@ LET OP @#@
The calculation in this chapter is very important and will present in the exam therefore please refer to the slides for examples and do the calculations yourself
tf, df, idf, TF-IDF, length normalization, dot product, cosine similarity
Summary: Vector Space ranking with term weighting
- represent each document and query as weighted tf-idf vector
- compute cosine similarity score for the query vector and each document vector
- rank documents with respect to the query by score
- return top K to the user
scope vs. verbosity hypothesis
scope: a long document consists of a number of unrelated short documents concatenated together
verbosity: a long document covers a similar scope to a short document, but uses more words
Criterias for comparing IR models/systems
effectiveness, efficiency, explainability, parsimony (Occam’s Razor)