Text normalization Flashcards

1
Q

Text Normalization, on what it depends?

A

Text normalization is the process of transforming a text into some predefined standard form.

There is no all-purpose normalization procedure. Text normalization depends on:

  • what type of text is being normalized
  • what type of NLP task needs to be carried out afterwards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Text normalization: contraction, punctuation and special characters

A
  • Contractions: contracted forms (English: we’ll, don’t) and abbreviations (English: inc., Mr.) should be managed before further normalization
  • Punctuation: marks in text need to be isolated and treated as if they were separate words. This is critical for finding sentence boundaries and for identifying some aspects of meaning
  • Special characters: emoticons (regular expressions), emoji (specialized libraries)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tokenization and groups of techniques

A

Is the process of segmenting text into units called tokens.

Tokenization techniques can be grouped into three families:

  • character tokenization
  • word tokenization
  • subword tokenization

Tokens are then organized into a vocabulary and, depending on the specific NLP application, may later be mapped into natural numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Word tokenization, in what languages it is used? Exceptions…

A

Word tokenization is the most common approach for European languages.

Important exceptions for english:

  • special compound names (white space vs. whitespace)
  • city names (San Francisco, Los Angeles), companies, etc.

There are certain language-independent tokens that require specialized processing:

  • phone numbers
  • dates
  • email addresses
  • web URLs
  • hashtags
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Character tokenization

A

Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.

Each character generally represents a single unit of meaning.

Word tokenization results in huge vocabulary, with large number of very rare words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Subword tokenization, token learner & segmenter

A

Many NLP systems need to deal with unknown words. To deal with this problem, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.

Subword tokenization schemes have two parts:

  • the token learner takes a raw training corpus and induces a set of tokens, called vocabulary
  • the token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Subword tokenization: Byte-pair encoding (BPE) tokenization

A

The BPE token learner is usually run inside words (not merging across word boundaries).

The algorithm iterates through the following steps:

  1. begin with a vocabulary composed by all individual characters and a corpus
  2. choose the two symbols A, B that are most frequently adjacent in the corpus
  3. ADD a new merged symbol AB to the vocabulary
  4. REPLACE every adjacent A, B in the corpus with AB

After several iterations (k), BPE learns:

  • entire words
  • most frequent units, useful for tokenizing unknown words

The BPE token segmenter runs on test data the merges we have learned. Merges are applied:

  • in the order we learned them
  • greedily on each word
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stop words removal, stemming and lemmatization

A
  • Stop words removal includes getting rid of common articles, pronouns, prepositions and coordinations
  • Stemming refers to the process of slicing a word with the intention of removing affixes
  • Lemmatization has the objective of reducing a word to its base form, also called lemma, therefore grouping together different forms of the same word (am, are, is > be)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Substitution operator

A

In NLP, the substitution operator refers to the process of replacing one word or phrase with another word or phrase.

Usually done through REs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly