Vector Space Models Flashcards by Branden Wheeler

Give examples of common information retrieval models

Boolean model
Vector space model
Probabilistic model
Language model
Neural network model
Graph-based model

How well did you know this?

Not at all

Perfectly

What is the vector-space model?

Represents documents and queries as vectors in multi-dimensional space, allowing for ranking by calculating similarity between vectors

How well did you know this?

Not at all

Perfectly

What is the assumption associated with similarity based models like vector space?

If a document is more similar to the query than another document, it must have higher relevance

How well did you know this?

Not at all

Perfectly

What is a term vector?

A vector that represents a query or a doc where each term (word or phrase) defines one dimension of the space.
Ex: q =(x1,…,xn), xi is query term weight

How well did you know this?

Not at all

Perfectly

What is not specified to us by the vector space model?

How we should define the dimensions
How we should place doc vectors in the space
How we should place the query vector in the space
How we measure the similarity between query and doc vectors

How well did you know this?

Not at all

Perfectly

What is Bag of Words?

A model for text representation that treats documents as collections of individual words regardless of their order

How well did you know this?

Not at all

Perfectly

How is bag of words used in the vector space model>

Each word in the vocabulary becomes one dimension of the space

How well did you know this?

Not at all

Perfectly

What is a bit vector?

A way of representing documents in binary format, primarily for boolean retrieval. A 1 indicates a document contains a vocabulary term

How well did you know this?

Not at all

Perfectly

What is a similarity function?

A method of measuring similarity between query and doc vectors

How well did you know this?

Not at all

Perfectly

What is the dot product similarity function?

Measures the similariy between vectors
Sim(q,d) = q.d
= x1y1 + … + xnyn
For a bit vector representation, xi and yi can be 0 or 1. Basically shows number of overlapping terms

How well did you know this?

Not at all

Perfectly

What are some issues with the dot product similarity function?

Matching a word multiple times in a doc deserves more credit
Matching some words is more important than matching others

How well did you know this?

Not at all

Perfectly

How can we improve the bit vector representation?

Turning it into a term frequency vector. Each dimension of vector now represents count of each word rather than presence of word

How well did you know this?

Not at all

Perfectly

What problem does the term frequency vector solve and what problem is left unsolved?

It solves the problem of the frequency of words being important but it does not solve the issue of some words being more important than others

How well did you know this?

Not at all

Perfectly

What is document frequency?

The count of documents that contain a particular term

How well did you know this?

Not at all

Perfectly

What is inverse document frequency?

Weighting a word more heavily when it does not occur in many documents

How well did you know this?

Not at all

Perfectly

What is the formula for inverse document frequency?

Study These Flashcards

IDF(W) = log((M+1)/k)
Where M is the total number of docs and k is the number of docs containing W

How can we use inverse document frequency to solve an issue with the dot product function?

Study These Flashcards

Multiply documents bit vector or frequency vector by vector of inverse document frequencies for each term to increase term weight for rare words

What is the issue with the vector space model using IDF weighting?

Study These Flashcards

Some more irrelevant documents get ranked highly because of a frequently occuring word. A high term frequency does not necessarily indicate a relevant document

How can we solve the issue associated with VSM using IDF weighting?

Study These Flashcards

Using a term frequency transformation function to dampen the effect of high TF

What is the BM25 function?

Study These Flashcards

A term frequency transformation function.
y = ((k+1)x)/(x+k)
k increases the dampening effect on TF

How can you create a ranking function using BM25?

Study These Flashcards

Multiply the BM25 function with the inverse document frequency function and c(w, q) which is the vector value for the word in the document. Sum over all entries in the query vector

Why is BM25 a good function to use?

Study These Flashcards

It has an upper bound, it is robust, and it is effective

What problem does document length pose to search relevancy?

Study These Flashcards

Long documents are more likely to match any query by virtue of having more words so we need to find a way to penalize a long document

Why is it difficult to normalize a document based on length?

Study These Flashcards

It may be long because it uses more words which is meaningless or because it has more content which is meaningful

What is the pivoted length normalizer?

A way of normalizing document length based on the average size of a document within the corpus. Penalizes documents longer than average

What is the equation for the pivoted length normalizer?

Normalizer = 1 - b + b(|d|/avdl) Where b is a constant between 0 and 1 and determines the degree of penalization

How can we further improve the vector space model?

Refining the definition of a dimension to reduce dimensionality. Can do this through stemming, clustering, removing stop words

What are some ways of calculating similarity between query and doc vectors?

1. Cosine of angle between two vectors (cosine distance) 2. Euclidean distance 3. Dot product

What does it mean that most document vectors are sparse?

The vector records every term in the vocabulary and since most are not likely to be within the document, most elements will be 0

What is Euclidean distance?

Measures straight line distance between end points of the two vectors in euclidean space

How do we calculate euclidean distance?

d(P, Q) = sqrt((p1-q1)^2+...(pn-qn)^2) Lower distance means more similarity

Why is Euclidean distance a bad idea to use for relevancy?

THe distance value is large for vectors of different lengths. So similar documents may be separated just for being of different lengths

How can we solve the issue of Euclidean distance misrepresenting similarity between documents?

Instead rank by the angle between vectors created. Either rank in decreasing order of the angle or in increasing order of the cosine of the angle

How do we calculate the cosine similarity of two vectors?

cos(q, d) = (q.d)/(|q||d|) Where q.d = qidi for i in the vector |q| = sqrt(qi^2 for i in the vector)

What are some issues with the vector space model?

1. There is no semantic information, actual meaning of words is lost 2. Missing syntactic information like proximity and phrase structure 3. Assumption of term independence 4. Lacks control of a boolean model like requiring that some term be present

Vector Space Models Flashcards

(35 cards)