Introduction - Text Representation and Boolean Model Flashcards
Primary task of an IR system
Retrieve documents with content that is relevant to a user’s information need.
What IR is not
It is not a database management systems. These store and process well-defined data. A search within them is exact and deterministic while search in an IR system is probabilistic.
Bag of Words
Document is represented as consisting of words as independent units with word order ignored.
Coordinate Matching
Document relevance measured by the number of query terms appearing a document. Terms provide the dimensions with the length along a dimension being either 0 or 1. Similarity measure is dot product of query and document vectors. This does not consider frequency of query terms in documents.
Term Frequency
Weighting by the frequency of terms in the document.
Inverse Document Frequency
Weight terms proportionally to the reciprocal of the number of documents they appear in.
Document Length
Similarity measure should be normalised to prevent document getting high score simply due to length.
Vector Space Similarity
Cosine of the angle between the query and document vectors.
Document
An item which may satisfy the user’s information need.
Query
Representation of user’s information need.
Term
Any word or phrase that can serve as a link to a document.
Inverted File
Keep following information for each term:
- Document ID where this term occurs.
- Frequency of occurrence of this term in each document
- Possibly: Offset of this term in document
Tokenisation
Dividing a character stream into a sequence of distinct word forms (tokens). Separate on white-space, end of sentence punctuation, bracketing, hyphenation, apostrophes & slashes.
Stop Word
High-frequency word which is not useful for distinguishing between documents.
Equivalence Classes
Can be useful to put tokens into equivalence classes and treat a group of terms as the same term. This reduces size of index, may lead to improved retrieval and combined frequencies may better reflect content than individual frequencies.