Information Retrieval Flashcards
What are the components of information retrieval?
Documents
Index
Query
Matching
What is the formula for Zipf’s law?
F(r)=C/r^α
log(F(r)) = log(C) - αlog(r)
What are the two parts of text pre-processing?
Stop word removal
Stemming
What is stop word removal?
The removal of common ‘noise words’ from text (e.g. ‘the’, ‘and’)
What is stemming?
Removing irrelevant differences from different ‘versions’ of the same word
This reduces the number of unique words in a corpus but increases the number of instances of each word
What is the formula for the inverse document frequency?
IDF(t)=log(ND/ND_t )
What is the formula for the term frequency - inverse document frequency weight?
w_td=f_td.IDF(t)
What is the formula for the similarity between a document and a query?
sim(q,d)=[sum of all terms in q and d(w_td.w_tq)]/(||q||.||d||)
What’s the formula for document length?
||d||= √(∑w_td^2 )
What is the formula for recall?
recall=|retrieved ∩relevant|/|relevant|
What is the formula for precision?
precision=|retrieved ∩relevant|/|retrieved|
What is query expansion?
Adding terms to a query in order to increase the overlap between the query and relevant documents
What is term reweighting?
Increasing the weight of query terms that appear in relevant documents and decreasing the weights of terms that don’t appear in relevant documents
What is a hyponym?
Subset of a word
What is a hypernym?
Superset of a word