Information Retrieval Flashcards
Precision
The fraction of retrieved results relevant to the information needed. P(Relevant | Retrieved)
Recall
The fraction of relevant documents in the collection retrieved by the system P(Retrieved | Relevant)
Term Frequency
For each document:
Number term i is in the document / Total number of terms in the document
Inverse Document Frequency
idf(i) = log(2) ( |D| / |{d E D: t(i) in d}|)
Number of documents / number of documents the term appears in
Term Weighting
Term Frequency x Inverse Document Frequencey
TF-IDF
Basis for assigning weights to terms in documents. Based upon how common a term is within a document and the frequency of a term in a document collection.
Sequence Pointer
Leaves in a B+Tree are linked to each other in a linked list. Range Queries or ordered iteration through the blocks simple and efficient. Advantage over B-Trees, no significant space increase
Logical Query Plan
Abstract algebraic representation of query, operators are taken from relational algebra
Physical Query Plan
Algorithms selected for each operator in the plan. Execution order is specified for each operator
Rocchio method
Query refinement through relevance feedback. Retrieve original queries, present results, ask user to indicate relevant/non-relevant, ‘push’ towards relevant vectors and away from non-relevant vector.
Stop Word Removal
Remove extremely common words from a ‘stop list’ e.g. a, o, the.
Stemming
Remove syntactic variations of a word, e.g. suffix-stripping or a lookup table.
Group Nouns
Nouns carry the most meaning. Use groups of adjacent nouns to index as terms