01 Boolean Retreival Flashcards
Why is grep not the solution?
- Slow (for large collections)
- grep is line-oriented, IR is document-oriented
- Other operations (e.g., find the word Romans near
countryman) not feasible
Why incidence matrix is not good?
- Matrix is extremely sparse.
What is a better representation than incidence matrix?
Inverted index: only record the 1s
What is inverted index?
- For each term t, we store a list of all documents that contain t.
- Consists of dictionary and postings
How is inverted index constructed?
1.Collect the documents
2. Tokenize
3. Linguistic preprocessing
4. Index the documents > sort > create posting list and count document frequency
What is the runtime for conjuctive query (AND) ?
linear in the sum length of the postings lists
only works if postings lists are sorted
What are boolean queries?
- queries that use and, or and not to join
query terms - Views each document as a set of terms
- Is precise: Document matches condition or not
Name commercially successful Boolean retrieval systems
- Westlaw
- PubMed
What is the best order for processing a query
that is an and of n terms, n > 2?
Example query: Brutus AND Calpurnia AND Caesar
Process in order of increasing frequency
Start with the shortest postings list, then keep cutting further
How to process :
(madding or crowd) and (ignoble or strife) and. . .
- Get frequencies for all terms
- Estimate the size of each or by the sum of its frequencies (conservative)
- Process in increasing order of or sizes