Text Tokenization Flashcards
Tokenization
Segmenting running text into words/tokens
Types of Tokenization
- Whitespace split
- Penn Treebank (whitespace and punctuation)
Word Type
Distinct word in a corpus
Corpus
Collection of written texts
Word Token
Total number of word types
Vocabulary
Set of word types.
Each word is represented using distinct ID
Word (Type) Frequency
Number of occurrences (tokens) of that type in a corpus
Normalization
Converting words to a standard form
Methods of Normalization
- Lemmatization
- Stemming
- Case Folding
Lemmatization
Return the word’s ‘lemma’ i.e. the root of a word despite its surface form. Strips word to an existing normalized form.
Stemming
Return the word’s ‘stem’, which chops off the word affixes
Case Folding
Mapping every word to its lowercase form
Character Tokenization
Represents each word using its characters,
+smaller vocabulary
+can process every word
-long sequence for each word
-not intuitive for learning meaning
Issue/Solution: Unknown Words
- Special UNK Token
- Character Tokenization
- Subword Tokenization (Byte-Pair)
Special UNK Token
Categorizes all unknown words under same token.