Word Level Analysis 3 Flashcards

1
Q

Definition of “Text Corpus”

A

A large body of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of corpus is Brown Corpus?

A

Categorised

one category per document, categories do not overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of corpus is Reuters Corpus?

A

Overlapping

multiple categories per document, categories overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What type of corpora are gutenberg,webtext and udhr?

A

Isolated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of corpus is inaugural?

A

Temporal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name three text corpora available in NLTK.

A

Gutenberg, brown and inaugural

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does this fileids() get you ?

A

the files of the corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does this fileids([categories]) get you ?

A

the files of the corpus corresponding to these categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does this categories() get you ?

A

the categories of the corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does this categories([fileids]) get you ?

A

the categories of the corpus corresponding to these files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does this raw(fileids=[f1,f2,f3]) get you ?

A

the raw content of the specified files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does this raw(categories=[c1,c2]) get you ?

A

the raw content of the specified categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does this raw() get you ?

A

the raw content of the corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does this words() get you ?

A

the words of the whole corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does words(fileids=[f1,f2,f3]) this get you ?

A

the words of the specified files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does this words(categories=[c1,c2]) get you ?

A

the words of the specified categories

17
Q

What does sents() this get you ?

A

the sentences of the whole corpus

18
Q

What does sents(fileids=[f1,f2,f3]) this get you ?

A

the sentences of the specified files

19
Q

What does sents(categories=[c1,c2]) this get you ?

A

the sentences of the specified categories

20
Q

Simplest way to analyse a corpus?

A

Count how many words it

contains

21
Q

What is a token?

A

A sequence of characters that is treated as a single group

i.e., words and punctuation

22
Q

What is a type?

A

A “type” is the form or spelling of the token (including words
and punctuation) independently of its specific occurrences
in a text.

23
Q

What is the vocabulary of a piece of text?

A

the set of tokens it uses a.k.a it’s types

24
Q

What does a Type-Token-Ratio (TTR) often used to measure?

A

Lexical Diversity

or: Describes the range of a speaker’s vocabulary

25
Q

What tool does NLTK provide to count the frequency of types?

A

It provides FreqDist (frequency distribution),.

26
Q

What is a Hapax?

A

A word that occurs only once in a text/corpus.

27
Q

Given text from multiple categories/genres, how do you create separate frequency distributions (FreqDist) for each category?

A

conditional frequency distribution

• collection of FreqDists, one for each “condition” (e.g., a category)