Vector Space Models Flashcards
Give examples of common information retrieval models
- Boolean model
- Vector space model
- Probabilistic model
- Language model
- Neural network model
- Graph-based model
What is the vector-space model?
Represents documents and queries as vectors in multi-dimensional space, allowing for ranking by calculating similarity between vectors
What is the assumption associated with similarity based models like vector space?
If a document is more similar to the query than another document, it must have higher relevance
What is a term vector?
A vector that represents a query or a doc where each term (word or phrase) defines one dimension of the space.
Ex: q =(x1,…,xn), xi is query term weight
What is not specified to us by the vector space model?
- How we should define the dimensions
- How we should place doc vectors in the space
- How we should place the query vector in the space
- How we measure the similarity between query and doc vectors
What is Bag of Words?
A model for text representation that treats documents as collections of individual words regardless of their order
How is bag of words used in the vector space model>
Each word in the vocabulary becomes one dimension of the space
What is a bit vector?
A way of representing documents in binary format, primarily for boolean retrieval. A 1 indicates a document contains a vocabulary term
What is a similarity function?
A method of measuring similarity between query and doc vectors
What is the dot product similarity function?
Measures the similariy between vectors
Sim(q,d) = q.d
= x1y1 + … + xnyn
For a bit vector representation, xi and yi can be 0 or 1. Basically shows number of overlapping terms
What are some issues with the dot product similarity function?
- Matching a word multiple times in a doc deserves more credit
- Matching some words is more important than matching others
How can we improve the bit vector representation?
Turning it into a term frequency vector. Each dimension of vector now represents count of each word rather than presence of word
What problem does the term frequency vector solve and what problem is left unsolved?
It solves the problem of the frequency of words being important but it does not solve the issue of some words being more important than others
What is document frequency?
The count of documents that contain a particular term
What is inverse document frequency?
Weighting a word more heavily when it does not occur in many documents