Query Processing with Inverted Index Flashcards
What is the process for inverted index construction?
- Documents to be indexed are fed to a tokenizer
- Tokenizer creates a list of tokens and passes it to linguistic modules
- Linguistic modules modify the tokens and pass to the indexer
- The indexer creates the index structure using the tokens as dictionary keys
What are the initial stages of text processing?
- Tokenization
- Normalization
- Stemming
- Removing stop words
What is tokenization?
The process of breaking a stream of text into individual units or tokens
What are some common issues for tokenization?
- Ambiguity: Words have multiple meanings
- Contractions: Could be split into multiple tokens
- Punctuation: Placement can affect tokenization
- Compound words: Some languages have long compound words that are hard to tokenize
- Hyphenated words: could be split into multiple tokens
- URLs and email addresses: Can contain special characters that affect tokenization
- Spelling errors: can cause incorrect tokenization
What is normalization?
Mapping text and query terms to the same form
Ex: You want U.S.A and USA to match
What is stemming?
Reducing terms to their roots before indexing
Ex: automate, automatic both go to automat
What are stop words?
Common words we may want to omit like the, a, to, of
What is the first indexer step?
Token sequence
Output is a sequence of (modified token, doc ID) pairs
What is the second indexer step?
Sorting pairs first by term and then by doc ID
What is the third indexer step?
Removing duplicates, creating dictionary and postings list, adding document frequency information if needed.
Number of docs containing term
How can you optimize boolean retrieval queries?
Process postings lists in order of increasing doc frequency to reduce total number of comparisons needed
How can you estimate the output document frequency of a boolean operation?
For AND: minimum document frequency for 2 terms
For OR: sum of document frequencies