Text Preprocessing Flashcards
Corpus
A computer-readable collection of text or speech.
Lemma
A set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.
Word-form
The full inflected or derived form of the word.
Word type
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V
, the number of types is the vocabulary size |V|
.
Work token
Tokens are the total number N
of running words.
Herdan’s Law
The larger the corpora, the more word types we find.
|V| = kN^β
The value of β
depends on the corpus size and the genre, but typically ranges from .67 to .75.
A.k.a Heaps’ Law
datasheet or data statement
Specifies properties of a dataset, like:
- Motivation
- Situation
- Language variety
- Speaker demographics
- Collection process
- Annotation process
- Distribution
Datasheet properties
Motivation
Why was the corpus collected, by whom, and who funded it?
Datasheet properties
Situation
When and in what situation was the text written / spoken?
E.g. was there a task? Was the language originally spoken conversation, edited text, social media communication, monologue vs dialogue?
Datasheet properties
Language variety
What language (including dialect / region) was the corpus in?
Datasheet properties
Speaker demographics
What as, e.g., age or gender of the authors of the text?
Datasheet properties
Collection process
How big is the data?
If it is a subsample how was it sampled?
Was the data collected with consent?
How was the data pre-processed, and what metadata is available?
Datasheet properties
Annotation process
What are the annotations, what are the demographics of the annotators, how were they trained, how was the data annotated?
Datasheet properties
Distribution
Are there copyright or other intellectual property restrictions?
3 common tasks associated with Text Normalisation
- Tokenizing (segmenting) words
- Normalizing word formats
- Segmenting sentences