Text normalization Flashcards
Text Normalization, on what it depends?
Text normalization is the process of transforming a text into some predefined standard form.
There is no all-purpose normalization procedure. Text normalization depends on:
- what type of text is being normalized
- what type of NLP task needs to be carried out afterwards
Text normalization: contraction, punctuation and special characters
- Contractions: contracted forms (English: we’ll, don’t) and abbreviations (English: inc., Mr.) should be managed before further normalization
- Punctuation: marks in text need to be isolated and treated as if they were separate words. This is critical for finding sentence boundaries and for identifying some aspects of meaning
- Special characters: emoticons (regular expressions), emoji (specialized libraries)
Tokenization and groups of techniques
Is the process of segmenting text into units called tokens.
Tokenization techniques can be grouped into three families:
- character tokenization
- word tokenization
- subword tokenization
Tokens are then organized into a vocabulary and, depending on the specific NLP application, may later be mapped into natural numbers.
Word tokenization, in what languages it is used? Exceptions…
Word tokenization is the most common approach for European languages.
Important exceptions for english:
- special compound names (white space vs. whitespace)
- city names (San Francisco, Los Angeles), companies, etc.
There are certain language-independent tokens that require specialized processing:
- phone numbers
- dates
- email addresses
- web URLs
- hashtags
Character tokenization
Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.
Each character generally represents a single unit of meaning.
Word tokenization results in huge vocabulary, with large number of very rare words
Subword tokenization, token learner & segmenter
Many NLP systems need to deal with unknown words. To deal with this problem, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.
Subword tokenization schemes have two parts:
- the token learner takes a raw training corpus and induces a set of tokens, called vocabulary
- the token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary
Subword tokenization: Byte-pair encoding (BPE) tokenization
The BPE token learner is usually run inside words (not merging across word boundaries).
The algorithm iterates through the following steps:
- begin with a vocabulary composed by all individual characters and a corpus
- choose the two symbols A, B that are most frequently adjacent in the corpus
- ADD a new merged symbol AB to the vocabulary
- REPLACE every adjacent A, B in the corpus with AB
After several iterations (k), BPE learns:
- entire words
- most frequent units, useful for tokenizing unknown words
The BPE token segmenter runs on test data the merges we have learned. Merges are applied:
- in the order we learned them
- greedily on each word
Stop words removal, stemming and lemmatization
- Stop words removal includes getting rid of common articles, pronouns, prepositions and coordinations
- Stemming refers to the process of slicing a word with the intention of removing affixes
- Lemmatization has the objective of reducing a word to its base form, also called lemma, therefore grouping together different forms of the same word (am, are, is > be)
Substitution operator
In NLP, the substitution operator refers to the process of replacing one word or phrase with another word or phrase.
Usually done through REs.