Information Retrieval Flashcards

1
Q

What are the components of information retrieval?

A

Documents
Index
Query
Matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula for Zipf’s law?

A

F(r)=C/r^α

log(F(r)) = log(C) - αlog(r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two parts of text pre-processing?

A

Stop word removal

Stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is stop word removal?

A

The removal of common ‘noise words’ from text (e.g. ‘the’, ‘and’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is stemming?

A

Removing irrelevant differences from different ‘versions’ of the same word
This reduces the number of unique words in a corpus but increases the number of instances of each word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the formula for the inverse document frequency?

A

IDF(t)=log(ND/ND_t )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formula for the term frequency - inverse document frequency weight?

A

w_td=f_td.IDF(t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the formula for the similarity between a document and a query?

A

sim(q,d)=[sum of all terms in q and d(w_td.w_tq)]/(||q||.||d||)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s the formula for document length?

A

||d||= √(∑w_td^2 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for recall?

A

recall=|retrieved ∩relevant|/|relevant|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for precision?

A

precision=|retrieved ∩relevant|/|retrieved|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is query expansion?

A

Adding terms to a query in order to increase the overlap between the query and relevant documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is term reweighting?

A

Increasing the weight of query terms that appear in relevant documents and decreasing the weights of terms that don’t appear in relevant documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a hyponym?

A

Subset of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a hypernym?

A

Superset of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the vector representation of a document contain?

A

The TF-IDF weight for each term in the corpus

17
Q

What does latent semantic analysis do?

A

Discovers relationships between words automatically from the data

18
Q

What is the formula for the word-document matrix?

A

A=USV^T
U and V are orthogonal
S is a diagonal matrix

S is the strength of the most significant correlation
V is the direction of the most significant correlation

19
Q

What is the formula for usefulness?

A

U(t) = P(t|T)log(P(t|T)/P(t))

20
Q

What is the formula for salience?

A

S(t) = P(T|t)log(P(T|t)/P(T))

Or

S(t) = P(T)*U(t)/P(t)

21
Q

What are the steps of Latent Dirichlet Allocation?

A
  1. Make an initial estimate of N topics
  2. Decompose each document into its component topics
  3. Use the decomposition to re-estimate the topic word probabilities
22
Q

What is the recursive page rank formula?

A

pr_(n+1) (d)= ∑pr_n (e).w_ed

23
Q

What is the Markov chain formula for page rank?

A

pr_(n+1) = W^T.pr_n

24
Q

What is the damping factor?

A

The probability that a user with exit at any given page, denoted by delta

25
Q

What is the formula for page rank including damping?

A

pr(d)=(1-δ)/N + δ∑pr_n (e).w_ed

N = number of documents