Word Level Analysis 3 Flashcards
Definition of “Text Corpus”
A large body of text
What type of corpus is Brown Corpus?
Categorised
one category per document, categories do not overlap
What type of corpus is Reuters Corpus?
Overlapping
multiple categories per document, categories overlap
What type of corpora are gutenberg,webtext and udhr?
Isolated
What type of corpus is inaugural?
Temporal
Name three text corpora available in NLTK.
Gutenberg, brown and inaugural
What does this fileids() get you ?
the files of the corpus
What does this fileids([categories]) get you ?
the files of the corpus corresponding to these categories
What does this categories() get you ?
the categories of the corpus
What does this categories([fileids]) get you ?
the categories of the corpus corresponding to these files
What does this raw(fileids=[f1,f2,f3]) get you ?
the raw content of the specified files
What does this raw(categories=[c1,c2]) get you ?
the raw content of the specified categories
What does this raw() get you ?
the raw content of the corpus
What does this words() get you ?
the words of the whole corpus
What does words(fileids=[f1,f2,f3]) this get you ?
the words of the specified files
What does this words(categories=[c1,c2]) get you ?
the words of the specified categories
What does sents() this get you ?
the sentences of the whole corpus
What does sents(fileids=[f1,f2,f3]) this get you ?
the sentences of the specified files
What does sents(categories=[c1,c2]) this get you ?
the sentences of the specified categories
Simplest way to analyse a corpus?
Count how many words it
contains
What is a token?
A sequence of characters that is treated as a single group
i.e., words and punctuation
What is a type?
A “type” is the form or spelling of the token (including words
and punctuation) independently of its specific occurrences
in a text.
What is the vocabulary of a piece of text?
the set of tokens it uses a.k.a it’s types
What does a Type-Token-Ratio (TTR) often used to measure?
Lexical Diversity
or: Describes the range of a speaker’s vocabulary
What tool does NLTK provide to count the frequency of types?
It provides FreqDist (frequency distribution),.
What is a Hapax?
A word that occurs only once in a text/corpus.
Given text from multiple categories/genres, how do you create separate frequency distributions (FreqDist) for each category?
conditional frequency distribution
• collection of FreqDists, one for each “condition” (e.g., a category)