02 Term Vocabularies and Postings Lists Flashcards
1
Q
Why Simple Boolean retrieval system in reality can be hard?
A
- different format and language
- What is the document unit for indexing? (depends on design decision)
Paragraph? Sentence? XML elements?
2
Q
What is the difinition of tokens ?
A
- Character sequence in document
- closely related to Word
3
Q
What is the difinition of type ?
A
- Equivalence class of tokens
- related to Term: (normalized type)
4
Q
I like the coffee, coffees, and the shop.
How many tokens? Types? Terms?
A
- 11 tokens
- 9 types
- 8 terms
5
Q
In June, the dog likes to chase the cat in the barn.
How many tokens? Types? Terms?
A
- 14 tokens
- 12 types
- 11 terms
6
Q
Give examples of Tokenization problems
A
- one word or two? (York University vs. New York University)
- number format
- no whitespace (Chinese, Japanese)
- ambiguous segmentation (Chinese)
- compound words (German)
- bidirectionality (Arabic)
- accents and umlauts