Corpus Annotation Flashcards
Why annotate?
- abstract from individual cases
• e.g. research on a specific word is facilitated by lemma annotations, which abstract
from the different word forms, the lemma give stands for give, gives, gave, given, giving
• e.g. research on word order is facilitated by part-of-speech annotations
a man who …VERB vs. a man VERB who …,as in a man entered who I had not seen before - disambiguate ambiguous forms, e.g. walk as a verb or noun
- facilitate quantitative investigations
• e.g. how long (#characters) is an average verb in English?
Note: annotations are interpretations
Leech’s 7 maxims of annotation
- It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus.
- It should be possible to extract the annotations by themselves from the text.
- The annotation scheme should be based on guidelines which are available to
the end user. - It should be made clear how and by whom the annotation was carried out.
- The end user should be made aware that the corpus annotation is not error- free or infallible, but simply a potentially useful tool.
- Annotation schemes should be based as far as possible on widely agreed and
theory-neutral principles. - No annotation scheme has the a priori right to be considered as a standard.
Standards emerge through practical consensus.
Types of annotations across linguistic levels
Phonological level
• Syllable boundaries (phonetic/phonemic annotation)
• Prosodic or suprasegmental features (prosodic annotation, e.g. pitch,
loudness, intonation)
Morphological level
• Prefixes, suffixes, stems (morphological annotation)
Lexical level
• Tokenization (essential for Chinese)
• Parts of speech (POS tagging) e.g. present: NN1, VVB, JJ
• Lemmas (lemmatization) stop, stopped, stops, stopping → stop
• Semantic fields (semantic annotation) cricket: sport, insect
Basic annotations on word level
• Plain text (also raw text) only sequences of characters without explicit
information about words or sentences
• Tokenization segmentation of RAW text into words/tokens and sentences
• sequence of characters are divided into words/tokens
• sequences of tokens is divided into sentences
• Stemming and lemmatization
• stemming: cutting off suffixes (no lexicon involved)
• lemmatization: base form taken from a lexicon
• POS-tagging
• labeling each word in a sequence of words with the appropriate part of speech (POS)
Annotation strategies
- manual annotation
- problem consistency (annotation guidelines)
- inter-annotator agreement
- time consuming and costly
- automatic annotation
- consistent
- fast and inexpensive
- false annotations / ambiguous annotations
- (semi-)automatic
- automatic preprocessing
- manual correction
- manual disambiguation
Syntactic annotation
Tree diagram
Labelled bracketing
Sentence level:
[s1The snake killed the rat and swallowed it]
Clause level:
[s1[c1The snake killed the rat] and [c2swallowed it]]
Phrase level:
[s1[c1[NPThe snake] [VPkilled [NPthe rat]]] and [c2[VPswallowed [NPit]]]]
Word level:
[s1[c1[NP[DTThe] [Nsnake]] [VP[Vkilled] [NP[DTthe] [Nrat]]]] [Conjand]
[c2[VP[Vswallowed] [NP[PPit]]]]]
Semantic annotation
Synonym -> similar context
Homonymy, polysemy -> different context
semantic fields: sense relations (word senses) and some other kinds
of relations (e.g., part-of, related-to etc.)
• annotation (cf. PoS tagging):
• definition of tagging scheme (labels and their meanings)
• tagging scheme: guidelines for application
• in semantics: this is not as easy and straightforward as for PoS
Discourse annotation
coherence: what makes a text hang together in terms of content
• cohesion: the means of making a text hang together
reference,
lexical cohesion,
substitution/ellipsis,
conjunctive relations (cause, result, effect etc.),
thematic development
Anaphoric relations
Links between a proform and an antecedent