Tokenization Flashcards

Question 1

Q

What is a document?

Answer

A

Unit for storage and retrieval in IR systems.
IR systems store and return documents (document identifiers, links, or indices).
The actual document type is chosen by the system designer (e.g., whole books, chapters, pages).

Question 2

Q

Why is Document Type Relevant?

Answer

A

Scoring: Affects the scoring of terms on documents based on the total number of words.
Showing Results: The size of the document matters when returning links or IDs to users.

Question 3

Q

Metadata

Answer

A

Files often come with metadata indicating encoding and language.
Useful for scoring and metadata search.
HTML meta tags can provide additional information.

Question 4

Q

Words, Tokens, and Terms

Answer

A

Word: Delimited string of characters in the document or query.
Term: Unique normalized word, forming an equivalence class of words.
Tokenizer: Tool normalizing words into terms.
Token: Instance of a term in the document or query.

Question 5

Q

Finding Words

Answer

A

Words defined in linguistics as indivisible sets of speech sounds.
Compound words may be open, closed, or hyphenated.
Diacritics may be kept or removed based on user behavior.

ex. “Search engine” is an open compound word, while “airport” is a closed compound word.

Question 6

Q

Normalization

Answer

A

Process of transforming text into a canonical form.
Includes case folding, diacritic removal, contraction substitution, acronym expansion, and abbreviation expansion.
Applied to both documents and queries.

Question 7

Q

Lemmatization and Stemming

Answer

A

Lemmatization: Reducing a word to its dictionary headword form.
Stemming: Removing suffixes to achieve similar effects.
Both increase recall.

Question 8

Q

Porter Stemmer

Answer

A

Well-known stemmer with five steps, each with rules.
Example rule (Step 1a): dealing with plurals.
Applied in sequence.

Question 9

Q

Lemmatization vs. Stemming

Answer

A

Lemmatization: Reverses inflection, always produces a word, requires linguistic knowledge.
Stemming: Fast, crude, does not distinguish inflection from derivation.

Question 10

Q

When Stemming Can Help/Hurt

Answer

A

Help: Better for shorter words with shorter suffixes, single-step applications.
Hurt: Bad for ambiguous lemmas or words with many corner cases.

Question 11

Q

Content Words and Function Words

Answer

A

Content Words: Have semantic content, contribute to sentence meaning.
Function Words: Have little substantive meaning, denote grammatical relationships.
Stopwords are common function words.

Question 12

Q

Stopwords

Answer

A

Early IR systems had lists of stopwords to save disk space.
Today, stopwords are stored as they can be useful in phrase queries.
Stopwords are mostly function words.

Question 13

Q

Normalization vs. Tokenization

Answer

A

Normalization: Standardizes text representation for consistency.
Tokenization: Breaks text into meaningful units (tokens) for analysis.

Normalization ensures consistency, while tokenization segments text into meaningful units.

Brainscape's Knowledge GenomeTM

Tokenization Flashcards

Brainscape's Knowledge Genome^TM