week5 Flashcards
what is a corpus
A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
we can classify corpora by (6)
- mode (written/spoken)
- representativeness (balanced/specialized)
- time (diachronic/synchronic)
- language (monolonual/..)
- sampling (full docs/samples (curated))
- mark-up (raw/POS-tagged)
problem with tokenization
& solved by:
there is ambiguity = solved by POS-tagging
what is POS tagging?
identification of word class
word class determined by:
- semantics
- context of use (nouns after determiners)
- possible affixations
(thus context needed for tagging)
zipfs law is F(z) = |C|/z&^a
what are these things?
f is the frequency of a word (token) z divided by the rank of the word. so word of rank 2 should occur 1/2 times the frequency of the word of rank 1.
what are tokens
statistical units of study (usually words)
what are types
the unique! words
give 2 purposes of corpora
- applicative: development of NLP tools
- analytical: empirical basis on the distribution of constructions and language phenomena
a balanced corpus needs to be (5)
- big
- mixed (spoken and written words)
- needs to be general (so include many different genres)
- well documented
- cover wide range of text categories (long and short texts)
what is the link level in parallel corpora?
at what level are the languages linked. sentence level or word level.