TEXT PRE PROCESSING Flashcards
Before we start text preprocessing what is it important to think about
What genre and domains the text is in
Genres: social media/emails/literature
Domain: Chemistry/politics/entertainment
Because we may need specific resources
Thinks like format and punctuation will change depending
What is tokenisation
Breaking input into individual units of text (tokens)
Easiest way is to use whitespace
Issue with whitespace
some languages do not use whitespace so it will not tokenise correctly
What are the typical tokenisation steps
- Initial segmentation (White space)
-handling abbreviations and apostrophes
-handling hyphenations
-dealing with other special exps (URLs)
Traditional NLP tokenisation
Uses simple word like tokens
Modern NLP tokenisation
Subword tokenisation
What is normalisation
Process of standardizing and transforming text data to a common, consistent format
Consists of:
Lowercasing
Removing Punctuation
Removing Stop Words: removing common words (e.g., “the,” “and,” “is”)
Stemming
Lemmatization
What is lemmatisation
reduction to “dictionary headword” form
(lemma)
{am, are, is} -> be
What is morphological analysis
Words are not the only units of ‘meaning’
subwords = morphemes, have some meaning
(un) (happy) (ness)
(prefix) (stem) (suffix)
So morphological analysis is taking a morpheme and seeing what its stem and affixes are
This can depend on the context
What is a ‘derivation’
formation of a word from its stem and suffixes and affixes
eg Un-Happi-Ness
Friend-ly
What is inflection
The modification of a word to express different grammatical roles
eg come -> came
waiter -> waitress
What is regular inflectional morphology
Changes that occur to express grammatical features like tense, number, gender, case
Easy and predictable
What is derivational morphology
Creation of new words by adding affixes or other modifications
changes the word class
teach -> teacher
What is stemming
Chops “ends of words”
removes suffixes, sometimes prefixes
quite quick
Can largely neutralise inflection and some derivation
Can yield non-words
What is under-stemming
fails to conflate related terms
divide -> divid
division -> divis
What is over-stemming
conflates unrelated terms
neutron, neutral -> neutr
What is the Porter stemmer
One of the most common stemmers in the English language
rules based
focuses on suffix stripping
What is character n-gram tokenisation
Breaks down text into “n” length substring tokens
What is Byte-pair encoding
We add _ at the end of all words to know where the words are separated
Let the initial vocabulary {A,B,C…a,b,c…} be the set of individual characters in the corpus
Choose the two symbols which are the most frequently adjacent in the corpus eg ‘e’‘r’
Add a new symbol ‘er’
{…x,y,z,er}
replace all ‘e’‘r’ with ‘er’
Until k merges have been done
What is a token learner
takes a raw training corpus and induces a vocabulary
(inventory of tokens)
What is a token segmenter
takes a raw test i.e. input sentence and tokenizes it according to that vocabulary
What is the result of a BPE token learner
Most words and subwords (affixes?) will be represented as full symbols
{ing, ed, er}
Very rare tokens (including unknown words) will be represented by their parts (subwords)
Can ‘control’ k (the number of merges) as a parameter depending on how many symbols we want in the vocabulary
What is a BPE token segmenter
new data, run each merge learnt from
the training data greedily and in the order we’ve learned them
– Merge every e r to er first, then merge er to er_
Words that have not been seen before will be represented by subtokens ‘low’‘er’
BPE reduces the number of unseen tokens (OOV)