Text normalization Flashcards

Question 1

Q

Text Normalization, on what it depends?

Answer

A

Text normalization is the process of transforming a text into some predefined standard form.

There is no all-purpose normalization procedure. Text normalization depends on:

what type of text is being normalized
what type of NLP task needs to be carried out afterwards

Question 2

Q

Text normalization: contraction, punctuation and special characters

Answer

A

Contractions: contracted forms (English: we’ll, don’t) and abbreviations (English: inc., Mr.) should be managed before further normalization
Punctuation: marks in text need to be isolated and treated as if they were separate words. This is critical for finding sentence boundaries and for identifying some aspects of meaning
Special characters: emoticons (regular expressions), emoji (specialized libraries)

Question 3

Q

Tokenization and groups of techniques

Answer

A

Is the process of segmenting text into units called tokens.

Tokenization techniques can be grouped into three families:

character tokenization
word tokenization
subword tokenization

Tokens are then organized into a vocabulary and, depending on the specific NLP application, may later be mapped into natural numbers.

Question 4

Q

Word tokenization, in what languages it is used? Exceptions…

Answer

A

Word tokenization is the most common approach for European languages.

Important exceptions for english:

special compound names (white space vs. whitespace)
city names (San Francisco, Los Angeles), companies, etc.

There are certain language-independent tokens that require specialized processing:

phone numbers
dates
email addresses
web URLs
hashtags

Question 5

Q

Character tokenization

Answer

A

Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.

Each character generally represents a single unit of meaning.

Word tokenization results in huge vocabulary, with large number of very rare words

Question 6

Q

Subword tokenization, token learner & segmenter

Answer

A

Many NLP systems need to deal with unknown words. To deal with this problem, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.

Subword tokenization schemes have two parts:

the token learner takes a raw training corpus and induces a set of tokens, called vocabulary
the token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary

Question 7

Q

Subword tokenization: Byte-pair encoding (BPE) tokenization

Answer

A

The BPE token learner is usually run inside words (not merging across word boundaries).

The algorithm iterates through the following steps:

begin with a vocabulary composed by all individual characters and a corpus
choose the two symbols A, B that are most frequently adjacent in the corpus
ADD a new merged symbol AB to the vocabulary
REPLACE every adjacent A, B in the corpus with AB

After several iterations (k), BPE learns:

entire words
most frequent units, useful for tokenizing unknown words

The BPE token segmenter runs on test data the merges we have learned. Merges are applied:

in the order we learned them
greedily on each word

Question 8

Q

Stop words removal, stemming and lemmatization

Answer

A

Stop words removal includes getting rid of common articles, pronouns, prepositions and coordinations
Stemming refers to the process of slicing a word with the intention of removing affixes
Lemmatization has the objective of reducing a word to its base form, also called lemma, therefore grouping together different forms of the same word (am, are, is > be)

Question 9

Q

Substitution operator

Answer

A

In NLP, the substitution operator refers to the process of replacing one word or phrase with another word or phrase.

Usually done through REs.

Text normalization Flashcards

(9 cards)