Week 4 Flashcards by Ollie Ursell

Text pre-processing steps

Document-level preparation
- Document conversion
- Language/domain identification

Tokenisation
- Case folding

Basic lexical pre-processing
- Lemmatization
- Stemming
- Spelling correction

How well did you know this?

Not at all

Perfectly

Lexical Processing

Words in a given context

Two main tasks:
Normalisation (stemming, lemmatisation)
POS Tagging (verbs, nouns, etc…)

How well did you know this?

Not at all

Perfectly

Normalisation

Map tokens to normalised forms

{walks, walked, walk, walking} -> walk

Two principle approaches:
- Lemmatisation
- Stemming

How well did you know this?

Not at all

Perfectly

Lemmatisation

Reduction to “dictionary headword” form (lemma)

Language Dependent

How to do:
Dictionaries - look-up might be slow
Morphological analysis

How well did you know this?

Not at all

Perfectly

Morphological analysis

Sub words (morphemes) have some ‘meaning’

un + happy + ness = unhappiness

Take a word and see what it’s stems and affixes are

Typically rule-based approach

Regular inflectional morphology is “easy” (in English)
Nouns are simple
Verbs are only slightly more complex
Irregular words are difficult (may still need a dictionary)

Depends on context

Any approach is quite computationally expensive

How well did you know this?

Not at all

Perfectly

Morpheme

The smallest constituent part of a word
Could be “dog” or “-s” but not “dogs”, it cannot be broken down

How well did you know this?

Not at all

Perfectly

Derivational morphology is messy

Quasi-systematicity
Irregular meaning change
Changes of word class

How well did you know this?

Not at all

Perfectly

Stemming

Chop ends of words (typically) before indexing
- Remove suffixes, possibly prefixes

studies -> sutdi
studying -> study

Language dependent, often heuristic, crude

May yield forms that are not words
automate, automatic -> automat

Neutralises inflections and some derivation

Merge related words into conflation groups

Useful for some applications, but may also be aggressive and conflate some unrelated groups
e.g. experiment, experience -> experi

How well did you know this?

Not at all

Perfectly

conflation groups

The groups stemming merges related words into

How well did you know this?

Not at all

Perfectly

Inflection

The modification of a word to express different grammatical categories / roles:

tense
person
number
…

How well did you know this?

Not at all

Perfectly

Under-stemming

Failure to conflate related forms
divide -> divid
division -> divis

How well did you know this?

Not at all

Perfectly

Over-stemming

conflates unrelated forms
neutron, neutral -> neutr

How well did you know this?

Not at all

Perfectly

Porter Stemmer

One of the most common algorithms for stemming in English

Rule-based = implement specific patterns

readily available in many languages

Main idea: “model” what the endings look like and strip them

In English, patterns depend in particular on ‘syllables’ at the end of the word and how long the word is

Step 1: Get rid of plurals and -ed or -ing suffixes
eed -> ee, agreed -> agree
If m > 0

Step 2: Turn terminal y to i when there is another vowel in the stem If m > 0

Step 3: Map double suffixes to single ones
If m > 0

Step 4: Deal with suffixes, -full, -ness, etc…
If m > 0

Step 5: Take off -ant -ence, etc…
If m > 1

Step 6: Tidy up (remove -e and double letters)
if m > 1 and last letter is e, remove.
probate -> probat

if m > 1 and last letter is l or d, single letter
controll -> control

Other rules can be applied
- conflict resolution, if more rules apply, use on with longest matching suffix for the given word

How well did you know this?

Not at all

Perfectly

Spelling error probabilities

0.05% in carefully edited newswire

26% in web queries

On average, 1 word in every tweet is misspelled

80% errors are 1 single-error mistakes

Almost all errors within edit distance 2

How well did you know this?

Not at all

Perfectly

Types of spelling error

Non-word errors
graffe -> giraffe

Real-word errors
piece -> peace This needs context

How well did you know this?

Not at all

Perfectly

Non-word spelling errors

Study These Flashcards

Non-word errors
graffe -> giraffe

Real- word spelling errors

Study These Flashcards

piece -> peace This needs context

Spelling Error Mistakes

Study These Flashcards

Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)

Causes of spelling errors

Study These Flashcards

Typographical
- Keyboard related errors (80% novice, 51% overall)
- Immediately adjacent keys (50% of novice, 31% of all)

Cognitive
- Phonetic - seperate -> separate
- Homophones - there -> their, piece -> peace

Optical character recognition
- “D” -> “O”, “ri” -> n

Spelling error detection

Study These Flashcards

Non-word errors
Any word not in a dictionary is an error
- The larger the dictionary the better … up to a point
- (The web is full of mis-spellings, so the web isn’t necessarily a good option)

Generating spelling error candidates

Study These Flashcards

Take the four common errors
Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)
And generate candidates that are <= k edit distance from the error

Can compute the fast using a Levenshstein Finite State Transducer

Or having a precomputed list of possible corrections

Information Retrieval paradigm

Study These Flashcards

We’d like to get the best of something, instead of finding the best we:
Find a (very) good candidate set (spelling candidates k errors<2)
Find the K best amongst them and return them as the best

These may not actually be the best

Methods to estimate P(x|w) for noisy channel / error probability model

Study These Flashcards

Simple local factors - consider the most important factors predicting an, insertion, deletion, transposition
- Unequal weightings attached to different editing operations
- Insertion and deletion probabilities are conditioned on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character
- Positional Information, the position in the string where the edit occurs

How does Google do spellchecking?

Study These Flashcards

Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries

How to find candidates for real world spelling errors

For each word w, generate a candidate set: - similar pronounciations - similar spellings - include w in the set Choose best candidate - Noisy Channel view of spell errors - Context-sensitive - so have to consider whether the surrounding words "make sense"

Porter Stemmer Notation - C

A list of c (consonants) longer than 1

Porter Stemmer Notation - V

A list of v (vowels) longer than 1

Porter Stemmer Notation - []

[C] is the arbitrary presence of a consonant

Porter Stemmer Notation - (condition)m

The condition repeated m times

Porter Stemmer Pattern words are decomposed into

[C](VC)m[V] m is then referred to as the words measure

Week 4 Flashcards

(30 cards)