Week 4 Flashcards
Text pre-processing steps
Document-level preparation
- Document conversion
- Language/domain identification
Tokenisation
- Case folding
Basic lexical pre-processing
- Lemmatization
- Stemming
- Spelling correction
Lexical Processing
Words in a given context
Two main tasks:
Normalisation (stemming, lemmatisation)
POS Tagging (verbs, nouns, etc…)
Normalisation
Map tokens to normalised forms
{walks, walked, walk, walking} -> walk
Two principle approaches:
- Lemmatisation
- Stemming
Lemmatisation
Reduction to “dictionary headword” form (lemma)
Language Dependent
How to do:
Dictionaries - look-up might be slow
Morphological analysis
Morphological analysis
Sub words (morphemes) have some ‘meaning’
un + happy + ness = unhappiness
Take a word and see what it’s stems and affixes are
Typically rule-based approach
Regular inflectional morphology is “easy” (in English)
Nouns are simple
Verbs are only slightly more complex
Irregular words are difficult (may still need a dictionary)
Depends on context
Any approach is quite computationally expensive
Morpheme
The smallest constituent part of a word
Could be “dog” or “-s” but not “dogs”, it cannot be broken down
Derivational morphology is messy
Quasi-systematicity
Irregular meaning change
Changes of word class
Stemming
Chop ends of words (typically) before indexing
- Remove suffixes, possibly prefixes
studies -> sutdi
studying -> study
Language dependent, often heuristic, crude
May yield forms that are not words
automate, automatic -> automat
Neutralises inflections and some derivation
Merge related words into conflation groups
Useful for some applications, but may also be aggressive and conflate some unrelated groups
e.g. experiment, experience -> experi
conflation groups
The groups stemming merges related words into
Inflection
The modification of a word to express different grammatical categories / roles:
tense
person
number
…
Under-stemming
Failure to conflate related forms
divide -> divid
division -> divis
Over-stemming
conflates unrelated forms
neutron, neutral -> neutr
Porter Stemmer
One of the most common algorithms for stemming in English
Rule-based = implement specific patterns
readily available in many languages
Main idea: “model” what the endings look like and strip them
In English, patterns depend in particular on ‘syllables’ at the end of the word and how long the word is
Step 1: Get rid of plurals and -ed or -ing suffixes
eed -> ee, agreed -> agree
If m > 0
Step 2: Turn terminal y to i when there is another vowel in the stem If m > 0
Step 3: Map double suffixes to single ones
If m > 0
Step 4: Deal with suffixes, -full, -ness, etc…
If m > 0
Step 5: Take off -ant -ence, etc…
If m > 1
Step 6: Tidy up (remove -e and double letters)
if m > 1 and last letter is e, remove.
probate -> probat
if m > 1 and last letter is l or d, single letter
controll -> control
Other rules can be applied
- conflict resolution, if more rules apply, use on with longest matching suffix for the given word
Spelling error probabilities
0.05% in carefully edited newswire
26% in web queries
On average, 1 word in every tweet is misspelled
80% errors are 1 single-error mistakes
Almost all errors within edit distance 2
Types of spelling error
Non-word errors
graffe -> giraffe
Real-word errors
piece -> peace This needs context