Text Tokenization Flashcards

Question 1

Q

Tokenization

Answer

A

Segmenting running text into words/tokens

Question 2

Q

Types of Tokenization

Answer

A

Whitespace split
Penn Treebank (whitespace and punctuation)

Question 3

Q

Word Type

Answer

A

Distinct word in a corpus

Question 4

Q

Corpus

Answer

A

Collection of written texts

Question 5

Q

Word Token

Answer

A

Total number of word types

Question 6

Q

Vocabulary

Answer

A

Set of word types.
Each word is represented using distinct ID

Question 7

Q

Word (Type) Frequency

Answer

A

Number of occurrences (tokens) of that type in a corpus

Question 8

Q

Normalization

Answer

A

Converting words to a standard form

Question 9

Q

Methods of Normalization

Answer

A

Lemmatization
Stemming
Case Folding

Question 10

Q

Lemmatization

Answer

A

Return the word’s ‘lemma’ i.e. the root of a word despite its surface form. Strips word to an existing normalized form.

Question 11

Q

Stemming

Answer

A

Return the word’s ‘stem’, which chops off the word affixes

Question 12

Q

Case Folding

Answer

A

Mapping every word to its lowercase form

Question 13

Q

Character Tokenization

Answer

A

Represents each word using its characters,
+smaller vocabulary
+can process every word
-long sequence for each word
-not intuitive for learning meaning

Question 14

Q

Issue/Solution: Unknown Words

Answer

A

Special UNK Token
Character Tokenization
Subword Tokenization (Byte-Pair)

Question 15

Q

Special UNK Token

Answer

A

Categorizes all unknown words under same token.

Question 16

Q

Subword Tokenization

Answer

A

Splitting words into smaller sub tokens.
Ex: Byte-Pair Encoding

Question 17

Q

Byte-Pair Encoding

Answer

A

Frequent words aren’t split.
Rare words are decomposed into smaller, meaningful subwords. (docs = doc + s).

Question 18

Q

Byte-Pair Encoding Process

Answer

A

Represent words as characters and then keep merging frequent character pairs to form k (specified parameter) new symbols.