Week 4 Flashcards
Text pre-processing steps
Document-level preparation
- Document conversion
- Language/domain identification
Tokenisation
- Case folding
Basic lexical pre-processing
- Lemmatization
- Stemming
- Spelling correction
Lexical Processing
Words in a given context
Two main tasks:
Normalisation (stemming, lemmatisation)
POS Tagging (verbs, nouns, etc…)
Normalisation
Map tokens to normalised forms
{walks, walked, walk, walking} -> walk
Two principle approaches:
- Lemmatisation
- Stemming
Lemmatisation
Reduction to “dictionary headword” form (lemma)
Language Dependent
How to do:
Dictionaries - look-up might be slow
Morphological analysis
Morphological analysis
Sub words (morphemes) have some ‘meaning’
un + happy + ness = unhappiness
Take a word and see what it’s stems and affixes are
Typically rule-based approach
Regular inflectional morphology is “easy” (in English)
Nouns are simple
Verbs are only slightly more complex
Irregular words are difficult (may still need a dictionary)
Depends on context
Any approach is quite computationally expensive
Morpheme
The smallest constituent part of a word
Could be “dog” or “-s” but not “dogs”, it cannot be broken down
Derivational morphology is messy
Quasi-systematicity
Irregular meaning change
Changes of word class
Stemming
Chop ends of words (typically) before indexing
- Remove suffixes, possibly prefixes
studies -> sutdi
studying -> study
Language dependent, often heuristic, crude
May yield forms that are not words
automate, automatic -> automat
Neutralises inflections and some derivation
Merge related words into conflation groups
Useful for some applications, but may also be aggressive and conflate some unrelated groups
e.g. experiment, experience -> experi
conflation groups
The groups stemming merges related words into
Inflection
The modification of a word to express different grammatical categories / roles:
tense
person
number
…
Under-stemming
Failure to conflate related forms
divide -> divid
division -> divis
Over-stemming
conflates unrelated forms
neutron, neutral -> neutr
Porter Stemmer
One of the most common algorithms for stemming in English
Rule-based = implement specific patterns
readily available in many languages
Main idea: “model” what the endings look like and strip them
In English, patterns depend in particular on ‘syllables’ at the end of the word and how long the word is
Step 1: Get rid of plurals and -ed or -ing suffixes
eed -> ee, agreed -> agree
If m > 0
Step 2: Turn terminal y to i when there is another vowel in the stem If m > 0
Step 3: Map double suffixes to single ones
If m > 0
Step 4: Deal with suffixes, -full, -ness, etc…
If m > 0
Step 5: Take off -ant -ence, etc…
If m > 1
Step 6: Tidy up (remove -e and double letters)
if m > 1 and last letter is e, remove.
probate -> probat
if m > 1 and last letter is l or d, single letter
controll -> control
Other rules can be applied
- conflict resolution, if more rules apply, use on with longest matching suffix for the given word
Spelling error probabilities
0.05% in carefully edited newswire
26% in web queries
On average, 1 word in every tweet is misspelled
80% errors are 1 single-error mistakes
Almost all errors within edit distance 2
Types of spelling error
Non-word errors
graffe -> giraffe
Real-word errors
piece -> peace This needs context
Non-word spelling errors
Non-word errors
graffe -> giraffe
Real- word spelling errors
piece -> peace This needs context
Spelling Error Mistakes
Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)
Causes of spelling errors
Typographical
- Keyboard related errors (80% novice, 51% overall)
- Immediately adjacent keys (50% of novice, 31% of all)
Cognitive
- Phonetic - seperate -> separate
- Homophones - there -> their, piece -> peace
Optical character recognition
- “D” -> “O”, “ri” -> n
Spelling error detection
Non-word errors
Any word not in a dictionary is an error
- The larger the dictionary the better … up to a point
- (The web is full of mis-spellings, so the web isn’t necessarily a good option)
Generating spelling error candidates
Take the four common errors
Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)
And generate candidates that are <= k edit distance from the error
Can compute the fast using a Levenshstein Finite State Transducer
Or having a precomputed list of possible corrections
Information Retrieval paradigm
We’d like to get the best of something, instead of finding the best we:
Find a (very) good candidate set (spelling candidates k errors<2)
Find the K best amongst them and return them as the best
These may not actually be the best
Methods to estimate P(x|w) for noisy channel / error probability model
Simple local factors - consider the most important factors predicting an, insertion, deletion, transposition
- Unequal weightings attached to different editing operations
- Insertion and deletion probabilities are conditioned on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character
- Positional Information, the position in the string where the edit occurs
How does Google do spellchecking?
Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries
How to find candidates for real world spelling errors
For each word w, generate a candidate set:
- similar pronounciations
- similar spellings
- include w in the set
Choose best candidate
- Noisy Channel view of spell errors
- Context-sensitive - so have to consider whether the surrounding words “make sense”
Porter Stemmer Notation - C
A list of c (consonants) longer than 1
Porter Stemmer Notation - V
A list of v (vowels) longer than 1
Porter Stemmer Notation - []
[C] is the arbitrary presence of a consonant
Porter Stemmer Notation - (condition)m
The condition repeated m times
Porter Stemmer Pattern words are decomposed into
Cm[V]
m is then referred to as the words measure