Corpus Annotation Flashcards

1
Q

Why annotate?

A
  1. abstract from individual cases
    • e.g. research on a specific word is facilitated by lemma annotations, which abstract
    from the different word forms, the lemma give stands for give, gives, gave, given, giving
    • e.g. research on word order is facilitated by part-of-speech annotations
    a man who …VERB vs. a man VERB who …,as in a man entered who I had not seen before
  2. disambiguate ambiguous forms, e.g. walk as a verb or noun
  3. facilitate quantitative investigations
    • e.g. how long (#characters) is an average verb in English?
     Note: annotations are interpretations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Leech’s 7 maxims of annotation

A
  1. It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus.
  2. It should be possible to extract the annotations by themselves from the text.
  3. The annotation scheme should be based on guidelines which are available to
    the end user.
  4. It should be made clear how and by whom the annotation was carried out.
  5. The end user should be made aware that the corpus annotation is not error- free or infallible, but simply a potentially useful tool.
  6. Annotation schemes should be based as far as possible on widely agreed and
    theory-neutral principles.
  7. No annotation scheme has the a priori right to be considered as a standard.
    Standards emerge through practical consensus.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of annotations across linguistic levels

A

Phonological level
• Syllable boundaries (phonetic/phonemic annotation)
• Prosodic or suprasegmental features (prosodic annotation, e.g. pitch,
loudness, intonation)
Morphological level
• Prefixes, suffixes, stems (morphological annotation)
Lexical level
• Tokenization (essential for Chinese)
• Parts of speech (POS tagging) e.g. present: NN1, VVB, JJ
• Lemmas (lemmatization) stop, stopped, stops, stopping → stop
• Semantic fields (semantic annotation) cricket: sport, insect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Basic annotations on word level

A

• Plain text (also raw text) only sequences of characters without explicit
information about words or sentences
• Tokenization segmentation of RAW text into words/tokens and sentences
• sequence of characters are divided into words/tokens
• sequences of tokens is divided into sentences
• Stemming and lemmatization
• stemming: cutting off suffixes (no lexicon involved)
• lemmatization: base form taken from a lexicon
• POS-tagging
• labeling each word in a sequence of words with the appropriate part of speech (POS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Annotation strategies

A
  • manual annotation
  • problem consistency (annotation guidelines)
  • inter-annotator agreement
  • time consuming and costly
  • automatic annotation
  • consistent
  • fast and inexpensive
  • false annotations / ambiguous annotations
  • (semi-)automatic
  • automatic preprocessing
  • manual correction
  • manual disambiguation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Syntactic annotation

A

Tree diagram
Labelled bracketing

Sentence level:
[s1The snake killed the rat and swallowed it]
Clause level:
[s1[c1The snake killed the rat] and [c2swallowed it]]
Phrase level:
[s1[c1[NPThe snake] [VPkilled [NPthe rat]]] and [c2[VPswallowed [NPit]]]]
Word level:
[s1[c1[NP[DTThe] [Nsnake]] [VP[Vkilled] [NP[DTthe] [Nrat]]]] [Conjand]
[c2[VP[Vswallowed] [NP[PPit]]]]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Semantic annotation

A

Synonym -> similar context
Homonymy, polysemy -> different context

semantic fields: sense relations (word senses) and some other kinds
of relations (e.g., part-of, related-to etc.)
• annotation (cf. PoS tagging):
• definition of tagging scheme (labels and their meanings)
• tagging scheme: guidelines for application
• in semantics: this is not as easy and straightforward as for PoS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Discourse annotation

A

coherence: what makes a text hang together in terms of content
• cohesion: the means of making a text hang together
reference,
lexical cohesion,
substitution/ellipsis,
conjunctive relations (cause, result, effect etc.),
thematic development

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Anaphoric relations

A

Links between a proform and an antecedent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly