Text Preprocessing Flashcards
Corpus
A computer-readable collection of text or speech.
Lemma
A set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.
Word-form
The full inflected or derived form of the word.
Word type
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V
, the number of types is the vocabulary size |V|
.
Work token
Tokens are the total number N
of running words.
Herdan’s Law
The larger the corpora, the more word types we find.
|V| = kN^β
The value of β
depends on the corpus size and the genre, but typically ranges from .67 to .75.
A.k.a Heaps’ Law
datasheet or data statement
Specifies properties of a dataset, like:
- Motivation
- Situation
- Language variety
- Speaker demographics
- Collection process
- Annotation process
- Distribution
Datasheet properties
Motivation
Why was the corpus collected, by whom, and who funded it?
Datasheet properties
Situation
When and in what situation was the text written / spoken?
E.g. was there a task? Was the language originally spoken conversation, edited text, social media communication, monologue vs dialogue?
Datasheet properties
Language variety
What language (including dialect / region) was the corpus in?
Datasheet properties
Speaker demographics
What as, e.g., age or gender of the authors of the text?
Datasheet properties
Collection process
How big is the data?
If it is a subsample how was it sampled?
Was the data collected with consent?
How was the data pre-processed, and what metadata is available?
Datasheet properties
Annotation process
What are the annotations, what are the demographics of the annotators, how were they trained, how was the data annotated?
Datasheet properties
Distribution
Are there copyright or other intellectual property restrictions?
3 common tasks associated with Text Normalisation
- Tokenizing (segmenting) words
- Normalizing word formats
- Segmenting sentences
Tokenization
The task of segmenting running text into words.
Clitic contractions
Contractions marked by apostrophes,
e.g. what're
we're
Clitic
A part of a word that can’t stand on it’s own, and can only occur when it is attached to another word.
we're
, j'ai
, l'homme
Subwords
Subwords can be arbitrary substrings, or they can be meaning-bearing units like the morphemes -est or -er.
Morpheme
The smallest meaning-bearing unit of a language.
e.g. the word unlikeliest
has the morphemes un-
, likely
, and -est
.
2 parts of most tokenization schemes
A token learner and a token segmenter.
token learner
takes a raw training corpus (sometimes roughly separated into words, e.g. by whitespace) and induces a vocabulary, a set of tokens.
token segmenter
takes a raw test sentence and segments it into the tokens in the vocabulary.
Byte-pair encoding algorithm
A token learner.
It begins with a vocabulary that is just the set of all individual characters.
It then examines the training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent ‘A’ ‘B’ in the corpus with the new ‘AB’.
It continues to count and merge, creating new longer and longer character strings, until k
merges have been done creating k
novel tokens.
k
is thus a parameter of the algorithm.
The resulting vocabulary consists of the original set of characters plus k
new symbols.
word normalisation
The task of putting words / tokens in a standard format, choosing a single normal form for words with multiple forms like USA
and US
or uh-huh
and uhhuh
.
lemmatization
the task of determining that two words have the same root, despite their surface differences.
e.g. am
, are
and is
have the shared lemma be
.
dinner
and dinners
both have the lemma dinner
.
Morphology
The study of the way words are built up from smaller meaning-bearing units called morphemes.
2 broad classes of morphemes
“stems” - the central morpheme of the word, supplying the main meaning
affixes - the “additional” meanings of various words.
Stemming
A naive version of morphological analysis.
This mainly consists of chopping off word-final affixes.
Porter Algorithm
Simple and efficient way to do stemming, stripping of affixes.
It does not have high accuracy but may be useful for some tasks.
Minimum edit distance
The minimum edit distance between two strings is defined as the minimum number of editing operations (e.g. insertion, deletion, substitution) needed to transform one string into another.
We can also assign a weight / cost to each of these operations. Levenshtein distance is the simplest, with each of the operations having a cost of 1.