Text Preprocessing Flashcards by Rayno Mostert

Corpus

A computer-readable collection of text or speech.

How well did you know this?

Not at all

Perfectly

Lemma

A set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.

How well did you know this?

Not at all

Perfectly

Word-form

The full inflected or derived form of the word.

How well did you know this?

Not at all

Perfectly

Word type

Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the number of types is the vocabulary size |V|.

How well did you know this?

Not at all

Perfectly

Work token

Tokens are the total number N of running words.

How well did you know this?

Not at all

Perfectly

Herdan’s Law

The larger the corpora, the more word types we find.

|V| = kN^β

The value of β depends on the corpus size and the genre, but typically ranges from .67 to .75.

A.k.a Heaps’ Law

How well did you know this?

Not at all

Perfectly

datasheet or data statement

Specifies properties of a dataset, like:

Motivation
Situation
Language variety
Speaker demographics
Collection process
Annotation process
Distribution

How well did you know this?

Not at all

Perfectly

Datasheet properties

Motivation

Why was the corpus collected, by whom, and who funded it?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Situation

When and in what situation was the text written / spoken?

E.g. was there a task? Was the language originally spoken conversation, edited text, social media communication, monologue vs dialogue?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Language variety

What language (including dialect / region) was the corpus in?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Speaker demographics

What as, e.g., age or gender of the authors of the text?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Collection process

How big is the data?

If it is a subsample how was it sampled?

Was the data collected with consent?

How was the data pre-processed, and what metadata is available?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Annotation process

What are the annotations, what are the demographics of the annotators, how were they trained, how was the data annotated?

How well did you know this?

Not at all

Perfectly

Datasheet properties

Distribution

Are there copyright or other intellectual property restrictions?

How well did you know this?

Not at all

Perfectly

3 common tasks associated with Text Normalisation

Tokenizing (segmenting) words
Normalizing word formats
Segmenting sentences

How well did you know this?

Not at all

Perfectly

Tokenization

Study These Flashcards

The task of segmenting running text into words.

Clitic contractions

Study These Flashcards

Contractions marked by apostrophes,

e.g. what're

we're

Clitic

Study These Flashcards

A part of a word that can’t stand on it’s own, and can only occur when it is attached to another word.

we're, j'ai, l'homme

Subwords

Study These Flashcards

Subwords can be arbitrary substrings, or they can be meaning-bearing units like the morphemes -est or -er.

Morpheme

Study These Flashcards

The smallest meaning-bearing unit of a language.

e.g. the word unlikeliest has the morphemes un-, likely, and -est.

2 parts of most tokenization schemes

Study These Flashcards

A token learner and a token segmenter.

token learner

Study These Flashcards

takes a raw training corpus (sometimes roughly separated into words, e.g. by whitespace) and induces a vocabulary, a set of tokens.

token segmenter

Study These Flashcards

takes a raw test sentence and segments it into the tokens in the vocabulary.

Byte-pair encoding algorithm

Study These Flashcards

A token learner.

It begins with a vocabulary that is just the set of all individual characters.

It then examines the training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent ‘A’ ‘B’ in the corpus with the new ‘AB’.

It continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens.

k is thus a parameter of the algorithm.

The resulting vocabulary consists of the original set of characters plus k new symbols.

word normalisation

The task of putting words / tokens in a standard format, choosing a single normal form for words with multiple forms like `USA` and `US` or `uh-huh` and `uhhuh`.

lemmatization

the task of determining that two words have the same root, despite their surface differences. e.g. `am`, `are` and `is` have the shared lemma `be`. `dinner` and `dinners` both have the lemma `dinner`.

Morphology

The study of the way words are built up from smaller meaning-bearing units called **morphemes**.

2 broad classes of morphemes

"stems" - the central morpheme of the word, supplying the main meaning **affixes** - the "additional" meanings of various words.

Stemming

A naive version of morphological analysis. This mainly consists of chopping off word-final affixes.

Porter Algorithm

Simple and efficient way to do **stemming**, stripping of affixes. It does not have high accuracy but may be useful for some tasks.

Minimum edit distance

The minimum edit distance between two strings is defined as the minimum number of editing operations (e.g. insertion, deletion, substitution) needed to transform one string into another. ## Footnote We can also assign a weight / cost to each of these operations. Levenshtein distance is the simplest, with each of the operations having a cost of 1.

Text Preprocessing Flashcards

(32 cards)