Query Processing with Inverted Index Flashcards

Question 1

Q

What is the process for inverted index construction?

Answer

A

Documents to be indexed are fed to a tokenizer
Tokenizer creates a list of tokens and passes it to linguistic modules
Linguistic modules modify the tokens and pass to the indexer
The indexer creates the index structure using the tokens as dictionary keys

Question 2

Q

What are the initial stages of text processing?

Answer

A

Tokenization
Normalization
Stemming
Removing stop words

Question 3

Q

What is tokenization?

Answer

A

The process of breaking a stream of text into individual units or tokens

Question 4

Q

What are some common issues for tokenization?

Answer

A

Ambiguity: Words have multiple meanings
Contractions: Could be split into multiple tokens
Punctuation: Placement can affect tokenization
Compound words: Some languages have long compound words that are hard to tokenize
Hyphenated words: could be split into multiple tokens
URLs and email addresses: Can contain special characters that affect tokenization
Spelling errors: can cause incorrect tokenization

Question 5

Q

What is normalization?

Answer

A

Mapping text and query terms to the same form
Ex: You want U.S.A and USA to match

Question 6

Q

What is stemming?

Answer

A

Reducing terms to their roots before indexing
Ex: automate, automatic both go to automat

Question 7

Q

What are stop words?

Answer

A

Common words we may want to omit like the, a, to, of

Question 8

Q

What is the first indexer step?

Answer

A

Token sequence
Output is a sequence of (modified token, doc ID) pairs

Question 9

Q

What is the second indexer step?

Answer

A

Sorting pairs first by term and then by doc ID

Question 10

Q

What is the third indexer step?

Answer

A

Removing duplicates, creating dictionary and postings list, adding document frequency information if needed.
Number of docs containing term

Question 11

Q

How can you optimize boolean retrieval queries?

Answer

A

Process postings lists in order of increasing doc frequency to reduce total number of comparisons needed

Question 12

Q

How can you estimate the output document frequency of a boolean operation?

Answer

A

For AND: minimum document frequency for 2 terms
For OR: sum of document frequencies

Query Processing with Inverted Index Flashcards

(12 cards)