Query Processing with Inverted Index Flashcards

1
Q

What is the process for inverted index construction?

A
  1. Documents to be indexed are fed to a tokenizer
  2. Tokenizer creates a list of tokens and passes it to linguistic modules
  3. Linguistic modules modify the tokens and pass to the indexer
  4. The indexer creates the index structure using the tokens as dictionary keys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the initial stages of text processing?

A
  1. Tokenization
  2. Normalization
  3. Stemming
  4. Removing stop words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is tokenization?

A

The process of breaking a stream of text into individual units or tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some common issues for tokenization?

A
  1. Ambiguity: Words have multiple meanings
  2. Contractions: Could be split into multiple tokens
  3. Punctuation: Placement can affect tokenization
  4. Compound words: Some languages have long compound words that are hard to tokenize
  5. Hyphenated words: could be split into multiple tokens
  6. URLs and email addresses: Can contain special characters that affect tokenization
  7. Spelling errors: can cause incorrect tokenization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is normalization?

A

Mapping text and query terms to the same form
Ex: You want U.S.A and USA to match

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stemming?

A

Reducing terms to their roots before indexing
Ex: automate, automatic both go to automat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are stop words?

A

Common words we may want to omit like the, a, to, of

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first indexer step?

A

Token sequence
Output is a sequence of (modified token, doc ID) pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the second indexer step?

A

Sorting pairs first by term and then by doc ID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the third indexer step?

A

Removing duplicates, creating dictionary and postings list, adding document frequency information if needed.
Number of docs containing term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you optimize boolean retrieval queries?

A

Process postings lists in order of increasing doc frequency to reduce total number of comparisons needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you estimate the output document frequency of a boolean operation?

A

For AND: minimum document frequency for 2 terms
For OR: sum of document frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly