TEXT PRE PROCESSING Flashcards
Before we start text preprocessing what is it important to think about
What genre and domains the text is in
Genres: social media/emails/literature
Domain: Chemistry/politics/entertainment
Because we may need specific resources
Thinks like format and punctuation will change depending
What is tokenisation
Breaking input into individual units of text (tokens)
Easiest way is to use whitespace
Issue with whitespace
some languages do not use whitespace so it will not tokenise correctly
What are the typical tokenisation steps
- Initial segmentation (White space)
-handling abbreviations and apostrophes
-handling hyphenations
-dealing with other special exps (URLs)
Traditional NLP tokenisation
Uses simple word like tokens
Modern NLP tokenisation
Subword tokenisation
What is normalisation
Process of standardizing and transforming text data to a common, consistent format
Consists of:
Lowercasing
Removing Punctuation
Removing Stop Words: removing common words (e.g., “the,” “and,” “is”)
Stemming
Lemmatization
What is lemmatisation
reduction to “dictionary headword” form
(lemma)
{am, are, is} -> be
What is morphological analysis
Words are not the only units of ‘meaning’
subwords = morphemes, have some meaning
(un) (happy) (ness)
(prefix) (stem) (suffix)
So morphological analysis is taking a morpheme and seeing what its stem and affixes are
This can depend on the context
What is a ‘derivation’
formation of a word from its stem and suffixes and affixes
eg Un-Happi-Ness
Friend-ly
What is inflection
The modification of a word to express different grammatical roles
eg come -> came
waiter -> waitress
What is regular inflectional morphology
Changes that occur to express grammatical features like tense, number, gender, case
Easy and predictable
What is derivational morphology
Creation of new words by adding affixes or other modifications
changes the word class
teach -> teacher
What is stemming
Chops “ends of words”
removes suffixes, sometimes prefixes
quite quick
Can largely neutralise inflection and some derivation
Can yield non-words
What is under-stemming
fails to conflate related terms
divide -> divid
division -> divis