Text Tokenization Flashcards

1
Q

Tokenization

A

Segmenting running text into words/tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Tokenization

A
  1. Whitespace split
  2. Penn Treebank (whitespace and punctuation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Word Type

A

Distinct word in a corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Corpus

A

Collection of written texts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Word Token

A

Total number of word types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Vocabulary

A

Set of word types.
Each word is represented using distinct ID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Word (Type) Frequency

A

Number of occurrences (tokens) of that type in a corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normalization

A

Converting words to a standard form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Methods of Normalization

A
  1. Lemmatization
  2. Stemming
  3. Case Folding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Lemmatization

A

Return the word’s ‘lemma’ i.e. the root of a word despite its surface form. Strips word to an existing normalized form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stemming

A

Return the word’s ‘stem’, which chops off the word affixes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Case Folding

A

Mapping every word to its lowercase form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Character Tokenization

A

Represents each word using its characters,
+smaller vocabulary
+can process every word
-long sequence for each word
-not intuitive for learning meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Issue/Solution: Unknown Words

A
  1. Special UNK Token
  2. Character Tokenization
  3. Subword Tokenization (Byte-Pair)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Special UNK Token

A

Categorizes all unknown words under same token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Subword Tokenization

A

Splitting words into smaller sub tokens.
Ex: Byte-Pair Encoding

17
Q

Byte-Pair Encoding

A

Frequent words aren’t split.
Rare words are decomposed into smaller, meaningful subwords. (docs = doc + s).

18
Q

Byte-Pair Encoding Process

A

Represent words as characters and then keep merging frequent character pairs to form k (specified parameter) new symbols.