Vector Space Models Flashcards

1
Q

Give examples of common information retrieval models

A
  1. Boolean model
  2. Vector space model
  3. Probabilistic model
  4. Language model
  5. Neural network model
  6. Graph-based model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the vector-space model?

A

Represents documents and queries as vectors in multi-dimensional space, allowing for ranking by calculating similarity between vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the assumption associated with similarity based models like vector space?

A

If a document is more similar to the query than another document, it must have higher relevance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a term vector?

A

A vector that represents a query or a doc where each term (word or phrase) defines one dimension of the space.
Ex: q =(x1,…,xn), xi is query term weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is not specified to us by the vector space model?

A
  1. How we should define the dimensions
  2. How we should place doc vectors in the space
  3. How we should place the query vector in the space
  4. How we measure the similarity between query and doc vectors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Bag of Words?

A

A model for text representation that treats documents as collections of individual words regardless of their order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is bag of words used in the vector space model>

A

Each word in the vocabulary becomes one dimension of the space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a bit vector?

A

A way of representing documents in binary format, primarily for boolean retrieval. A 1 indicates a document contains a vocabulary term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a similarity function?

A

A method of measuring similarity between query and doc vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the dot product similarity function?

A

Measures the similariy between vectors
Sim(q,d) = q.d
= x1y1 + … + xnyn
For a bit vector representation, xi and yi can be 0 or 1. Basically shows number of overlapping terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some issues with the dot product similarity function?

A
  1. Matching a word multiple times in a doc deserves more credit
  2. Matching some words is more important than matching others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we improve the bit vector representation?

A

Turning it into a term frequency vector. Each dimension of vector now represents count of each word rather than presence of word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What problem does the term frequency vector solve and what problem is left unsolved?

A

It solves the problem of the frequency of words being important but it does not solve the issue of some words being more important than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is document frequency?

A

The count of documents that contain a particular term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is inverse document frequency?

A

Weighting a word more heavily when it does not occur in many documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the formula for inverse document frequency?

A

IDF(W) = log((M+1)/k)
Where M is the total number of docs and k is the number of docs containing W

17
Q

How can we use inverse document frequency to solve an issue with the dot product function?

A

Multiply documents bit vector or frequency vector by vector of inverse document frequencies for each term to increase term weight for rare words

18
Q

What is the issue with the vector space model using IDF weighting?

A

Some more irrelevant documents get ranked highly because of a frequently occuring word. A high term frequency does not necessarily indicate a relevant document

19
Q

How can we solve the issue associated with VSM using IDF weighting?

A

Using a term frequency transformation function to dampen the effect of high TF

20
Q

What is the BM25 function?

A

A term frequency transformation function.
y = ((k+1)x)/(x+k)
k increases the dampening effect on TF

21
Q

How can you create a ranking function using BM25?

A

Multiply the BM25 function with the inverse document frequency function and c(w, q) which is the vector value for the word in the document. Sum over all entries in the query vector

22
Q

Why is BM25 a good function to use?

A

It has an upper bound, it is robust, and it is effective

23
Q

What problem does document length pose to search relevancy?

A

Long documents are more likely to match any query by virtue of having more words so we need to find a way to penalize a long document

24
Q

Why is it difficult to normalize a document based on length?

A

It may be long because it uses more words which is meaningless or because it has more content which is meaningful

25
Q

What is the pivoted length normalizer?

A

A way of normalizing document length based on the average size of a document within the corpus. Penalizes documents longer than average

26
Q

What is the equation for the pivoted length normalizer?

A

Normalizer = 1 - b + b(|d|/avdl)
Where b is a constant between 0 and 1 and determines the degree of penalization

27
Q

How can we further improve the vector space model?

A

Refining the definition of a dimension to reduce dimensionality. Can do this through stemming, clustering, removing stop words

28
Q

What are some ways of calculating similarity between query and doc vectors?

A
  1. Cosine of angle between two vectors (cosine distance)
  2. Euclidean distance
  3. Dot product
29
Q

What does it mean that most document vectors are sparse?

A

The vector records every term in the vocabulary and since most are not likely to be within the document, most elements will be 0

30
Q

What is Euclidean distance?

A

Measures straight line distance between end points of the two vectors in euclidean space

31
Q

How do we calculate euclidean distance?

A

d(P, Q) = sqrt((p1-q1)^2+…(pn-qn)^2)
Lower distance means more similarity

32
Q

Why is Euclidean distance a bad idea to use for relevancy?

A

THe distance value is large for vectors of different lengths. So similar documents may be separated just for being of different lengths

33
Q

How can we solve the issue of Euclidean distance misrepresenting similarity between documents?

A

Instead rank by the angle between vectors created. Either rank in decreasing order of the angle or in increasing order of the cosine of the angle

34
Q

How do we calculate the cosine similarity of two vectors?

A

cos(q, d) = (q.d)/(|q||d|)
Where
q.d = qidi for i in the vector
|q| = sqrt(qi^2 for i in the vector)

35
Q

What are some issues with the vector space model?

A
  1. There is no semantic information, actual meaning of words is lost
  2. Missing syntactic information like proximity and phrase structure
  3. Assumption of term independence
  4. Lacks control of a boolean model like requiring that some term be present