week5 Flashcards

1
Q

what is a corpus

A

A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

we can classify corpora by (6)

A
  • mode (written/spoken)
  • representativeness (balanced/specialized)
  • time (diachronic/synchronic)
  • language (monolonual/..)
  • sampling (full docs/samples (curated))
  • mark-up (raw/POS-tagged)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

problem with tokenization

& solved by:

A

there is ambiguity = solved by POS-tagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is POS tagging?

A

identification of word class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

word class determined by:

A
  • semantics
  • context of use (nouns after determiners)
  • possible affixations

(thus context needed for tagging)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

zipfs law is F(z) = |C|/z&^a

what are these things?

A

f is the frequency of a word (token) z divided by the rank of the word. so word of rank 2 should occur 1/2 times the frequency of the word of rank 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are tokens

A

statistical units of study (usually words)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are types

A

the unique! words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

give 2 purposes of corpora

A
  • applicative: development of NLP tools

- analytical: empirical basis on the distribution of constructions and language phenomena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

a balanced corpus needs to be (5)

A
  • big
  • mixed (spoken and written words)
  • needs to be general (so include many different genres)
  • well documented
  • cover wide range of text categories (long and short texts)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the link level in parallel corpora?

A

at what level are the languages linked. sentence level or word level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly