week5 Flashcards

Question 1

Q

what is a corpus

Answer

A

A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

Question 2

Q

we can classify corpora by (6)

Answer

A

mode (written/spoken)
representativeness (balanced/specialized)
time (diachronic/synchronic)
language (monolonual/..)
sampling (full docs/samples (curated))
mark-up (raw/POS-tagged)

Question 3

Q

problem with tokenization

& solved by:

Answer

A

there is ambiguity = solved by POS-tagging

Question 4

Q

what is POS tagging?

Answer

A

identification of word class

Question 5

Q

word class determined by:

Answer

A

semantics
context of use (nouns after determiners)
possible affixations

(thus context needed for tagging)

Question 6

Q

zipfs law is F(z) = |C|/z&^a

what are these things?

Answer

A

f is the frequency of a word (token) z divided by the rank of the word. so word of rank 2 should occur 1/2 times the frequency of the word of rank 1.

Question 7

Q

what are tokens

Answer

A

statistical units of study (usually words)

Question 8

Q

what are types

Answer

A

the unique! words

Question 9

Q

give 2 purposes of corpora

Answer

A

applicative: development of NLP tools

- analytical: empirical basis on the distribution of constructions and language phenomena

Question 10

Q

a balanced corpus needs to be (5)

Answer

A

big
mixed (spoken and written words)
needs to be general (so include many different genres)
well documented
cover wide range of text categories (long and short texts)

Question 11

Q

what is the link level in parallel corpora?

Answer

A

at what level are the languages linked. sentence level or word level.