Tokenization Flashcards

1
Q

What is a document?

A

Unit for storage and retrieval in IR systems.
IR systems store and return documents (document identifiers, links, or indices).
The actual document type is chosen by the system designer (e.g., whole books, chapters, pages).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is Document Type Relevant?

A

Scoring: Affects the scoring of terms on documents based on the total number of words.
Showing Results: The size of the document matters when returning links or IDs to users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Metadata

A

Files often come with metadata indicating encoding and language.
Useful for scoring and metadata search.
HTML meta tags can provide additional information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Words, Tokens, and Terms

A

Word: Delimited string of characters in the document or query.
Term: Unique normalized word, forming an equivalence class of words.
Tokenizer: Tool normalizing words into terms.
Token: Instance of a term in the document or query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Finding Words

A

Words defined in linguistics as indivisible sets of speech sounds.
Compound words may be open, closed, or hyphenated.
Diacritics may be kept or removed based on user behavior.

ex. “Search engine” is an open compound word, while “airport” is a closed compound word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Normalization

A

Process of transforming text into a canonical form.
Includes case folding, diacritic removal, contraction substitution, acronym expansion, and abbreviation expansion.
Applied to both documents and queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lemmatization and Stemming

A

Lemmatization: Reducing a word to its dictionary headword form.
Stemming: Removing suffixes to achieve similar effects.
Both increase recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Porter Stemmer

A

Well-known stemmer with five steps, each with rules.
Example rule (Step 1a): dealing with plurals.
Applied in sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lemmatization vs. Stemming

A

Lemmatization: Reverses inflection, always produces a word, requires linguistic knowledge.
Stemming: Fast, crude, does not distinguish inflection from derivation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When Stemming Can Help/Hurt

A

Help: Better for shorter words with shorter suffixes, single-step applications.
Hurt: Bad for ambiguous lemmas or words with many corner cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Content Words and Function Words

A

Content Words: Have semantic content, contribute to sentence meaning.
Function Words: Have little substantive meaning, denote grammatical relationships.
Stopwords are common function words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stopwords

A

Early IR systems had lists of stopwords to save disk space.
Today, stopwords are stored as they can be useful in phrase queries.
Stopwords are mostly function words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Normalization vs. Tokenization

A

Normalization: Standardizes text representation for consistency.
Tokenization: Breaks text into meaningful units (tokens) for analysis.

Normalization ensures consistency, while tokenization segments text into meaningful units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly