01 Boolean retrieval Flashcards
Information Retrievel
IR is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
Index
The way to avoid linearly scanning the text for each query is to index the docs in advance
Terms
Are the indexed units
Boolean retreival model
Is a model for information retrieval in which we can pose any query which is in the from of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR and NOT. The model views each document as just a set of words.
Collection (Corpus)
Group of documents over which we perform retrieval task.
Ad hoc retrieval
An IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary used information need.
Effectiveness
Two key factors:
Precision
What fraction of the returned results are relevant to the information need?
Recall
What fraction of the relevant documents in the collection were returned by the system?
Inverted index (II, inverted file)
We keep a dictionary of terms. For each term, we have a list that records which documents the term occurs in
Posting
Each item in a Inverted Index records the docID of the term.
Posting lists
The list of postings is a posting list
Postings
All posting lists put together.
Dictionary
All terms in the II put together
Sorting
Core idea of II. Terms must be alphabetically ordered.
Posting lists intersection (merging posting list)
Must intersect postings lists to find docs that contain both terms.
Common intersection algorithm:
- Maintain pointers to each PL, starting at the beginning
- Compare the docID pointed to by the pointers
- If they are the same, put posting in result list and advance both pointers 4. Otherwise, advance the pointer pointing to the smaller docID and re- peat from step 2.
O(x + y) operations where x and y are the length of the postinglist, but formally the complexity is O(N) where N is the size of the corpus.
Important! To use this algorithm, the PLs must be sorted by a single global ordering.
Query optimization
Is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the systen