Tokenization Flashcards
What is a document?
Unit for storage and retrieval in IR systems.
IR systems store and return documents (document identifiers, links, or indices).
The actual document type is chosen by the system designer (e.g., whole books, chapters, pages).
Why is Document Type Relevant?
Scoring: Affects the scoring of terms on documents based on the total number of words.
Showing Results: The size of the document matters when returning links or IDs to users.
Metadata
Files often come with metadata indicating encoding and language.
Useful for scoring and metadata search.
HTML meta tags can provide additional information.
Words, Tokens, and Terms
Word: Delimited string of characters in the document or query.
Term: Unique normalized word, forming an equivalence class of words.
Tokenizer: Tool normalizing words into terms.
Token: Instance of a term in the document or query.
Finding Words
Words defined in linguistics as indivisible sets of speech sounds.
Compound words may be open, closed, or hyphenated.
Diacritics may be kept or removed based on user behavior.
ex. “Search engine” is an open compound word, while “airport” is a closed compound word.
Normalization
Process of transforming text into a canonical form.
Includes case folding, diacritic removal, contraction substitution, acronym expansion, and abbreviation expansion.
Applied to both documents and queries.
Lemmatization and Stemming
Lemmatization: Reducing a word to its dictionary headword form.
Stemming: Removing suffixes to achieve similar effects.
Both increase recall.
Porter Stemmer
Well-known stemmer with five steps, each with rules.
Example rule (Step 1a): dealing with plurals.
Applied in sequence.
Lemmatization vs. Stemming
Lemmatization: Reverses inflection, always produces a word, requires linguistic knowledge.
Stemming: Fast, crude, does not distinguish inflection from derivation.
When Stemming Can Help/Hurt
Help: Better for shorter words with shorter suffixes, single-step applications.
Hurt: Bad for ambiguous lemmas or words with many corner cases.
Content Words and Function Words
Content Words: Have semantic content, contribute to sentence meaning.
Function Words: Have little substantive meaning, denote grammatical relationships.
Stopwords are common function words.
Stopwords
Early IR systems had lists of stopwords to save disk space.
Today, stopwords are stored as they can be useful in phrase queries.
Stopwords are mostly function words.
Normalization vs. Tokenization
Normalization: Standardizes text representation for consistency.
Tokenization: Breaks text into meaningful units (tokens) for analysis.
Normalization ensures consistency, while tokenization segments text into meaningful units.