Vector Space Models Flashcards
Give examples of common information retrieval models
- Boolean model
- Vector space model
- Probabilistic model
- Language model
- Neural network model
- Graph-based model
What is the vector-space model?
Represents documents and queries as vectors in multi-dimensional space, allowing for ranking by calculating similarity between vectors
What is the assumption associated with similarity based models like vector space?
If a document is more similar to the query than another document, it must have higher relevance
What is a term vector?
A vector that represents a query or a doc where each term (word or phrase) defines one dimension of the space.
Ex: q =(x1,…,xn), xi is query term weight
What is not specified to us by the vector space model?
- How we should define the dimensions
- How we should place doc vectors in the space
- How we should place the query vector in the space
- How we measure the similarity between query and doc vectors
What is Bag of Words?
A model for text representation that treats documents as collections of individual words regardless of their order
How is bag of words used in the vector space model>
Each word in the vocabulary becomes one dimension of the space
What is a bit vector?
A way of representing documents in binary format, primarily for boolean retrieval. A 1 indicates a document contains a vocabulary term
What is a similarity function?
A method of measuring similarity between query and doc vectors
What is the dot product similarity function?
Measures the similariy between vectors
Sim(q,d) = q.d
= x1y1 + … + xnyn
For a bit vector representation, xi and yi can be 0 or 1. Basically shows number of overlapping terms
What are some issues with the dot product similarity function?
- Matching a word multiple times in a doc deserves more credit
- Matching some words is more important than matching others
How can we improve the bit vector representation?
Turning it into a term frequency vector. Each dimension of vector now represents count of each word rather than presence of word
What problem does the term frequency vector solve and what problem is left unsolved?
It solves the problem of the frequency of words being important but it does not solve the issue of some words being more important than others
What is document frequency?
The count of documents that contain a particular term
What is inverse document frequency?
Weighting a word more heavily when it does not occur in many documents
What is the formula for inverse document frequency?
IDF(W) = log((M+1)/k)
Where M is the total number of docs and k is the number of docs containing W
How can we use inverse document frequency to solve an issue with the dot product function?
Multiply documents bit vector or frequency vector by vector of inverse document frequencies for each term to increase term weight for rare words
What is the issue with the vector space model using IDF weighting?
Some more irrelevant documents get ranked highly because of a frequently occuring word. A high term frequency does not necessarily indicate a relevant document
How can we solve the issue associated with VSM using IDF weighting?
Using a term frequency transformation function to dampen the effect of high TF
What is the BM25 function?
A term frequency transformation function.
y = ((k+1)x)/(x+k)
k increases the dampening effect on TF
How can you create a ranking function using BM25?
Multiply the BM25 function with the inverse document frequency function and c(w, q) which is the vector value for the word in the document. Sum over all entries in the query vector
Why is BM25 a good function to use?
It has an upper bound, it is robust, and it is effective
What problem does document length pose to search relevancy?
Long documents are more likely to match any query by virtue of having more words so we need to find a way to penalize a long document
Why is it difficult to normalize a document based on length?
It may be long because it uses more words which is meaningless or because it has more content which is meaningful
What is the pivoted length normalizer?
A way of normalizing document length based on the average size of a document within the corpus. Penalizes documents longer than average
What is the equation for the pivoted length normalizer?
Normalizer = 1 - b + b(|d|/avdl)
Where b is a constant between 0 and 1 and determines the degree of penalization
How can we further improve the vector space model?
Refining the definition of a dimension to reduce dimensionality. Can do this through stemming, clustering, removing stop words
What are some ways of calculating similarity between query and doc vectors?
- Cosine of angle between two vectors (cosine distance)
- Euclidean distance
- Dot product
What does it mean that most document vectors are sparse?
The vector records every term in the vocabulary and since most are not likely to be within the document, most elements will be 0
What is Euclidean distance?
Measures straight line distance between end points of the two vectors in euclidean space
How do we calculate euclidean distance?
d(P, Q) = sqrt((p1-q1)^2+…(pn-qn)^2)
Lower distance means more similarity
Why is Euclidean distance a bad idea to use for relevancy?
THe distance value is large for vectors of different lengths. So similar documents may be separated just for being of different lengths
How can we solve the issue of Euclidean distance misrepresenting similarity between documents?
Instead rank by the angle between vectors created. Either rank in decreasing order of the angle or in increasing order of the cosine of the angle
How do we calculate the cosine similarity of two vectors?
cos(q, d) = (q.d)/(|q||d|)
Where
q.d = qidi for i in the vector
|q| = sqrt(qi^2 for i in the vector)
What are some issues with the vector space model?
- There is no semantic information, actual meaning of words is lost
- Missing syntactic information like proximity and phrase structure
- Assumption of term independence
- Lacks control of a boolean model like requiring that some term be present