lecture 3 Flashcards

1
Q

corpus

A
  1. large/complete collection of writing
  2. linguistics. a body of utterances, as words or sentences, assumed to be representative of and used for lexical, grammatical, or other linguistic analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

properties of corpora

A
  1. naturally occurring corpora serve as realistic samples of language
  2. metadata: side information about where the samples come from
  3. corpora with linguistic annotations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

markup (formatting)

A

common formats for structuring linguistic data

–> e.g., XML, CoNLL, JSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

preprocessing

A

transforming text into a useful format

–> e.g., tokenization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

splitting sentences into words/tokens

A
  1. using spaces as delimiters
  2. tokenization
  3. stemming
  4. lemmatization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

tokenization

A

each word, including punctuation, is considered a separate token and assigned a sequential position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

stemming + process

A

consider words within sentences that share a common root, but appear in different forms

reduce word to base/root form
–> this enables us to treat them as a single unit for analysis

process:
1. tokenize sentences
2. use stemming to reduce words into common stems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

lemmatization

A

reduce words to common lemma’s

takes context into account to convert words to their base dictionary form

results in more linguistically correct forms than stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

bag of words

A

text is represented by the frequency of its words, disregarding grammar and word order

considers their occurrences as features for analysis

simplifies text into numerical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BOW process

A
  1. text is converted to list of words
  2. words are placed into metaphorical bag
  3. each word’s frequency is counted
  4. frequency count is represented in table format
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

n-gram

A
  • word sequence of length n
  • the more words, the more information is revealed
  • compute probabilities of these sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

n-gram probabilities

A

calculation: occurrence of the sequence / size of the corpus

words in a corpus could appear multiple times. this affects the calculation of probabilities and the overall model we create

it is therefore essential that n-grams model the probability distribution of all possible sequences of all tokens in the corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what do n-gram models evaluate

A

the probability of a sequence of n tokens

this is useful for determining the likelihood of specific word sequences and is represented by the joint probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

chain rule

A

product of conditional probabilities

indicates that the probability of the entire sequence is the product of the individual probabilities of each word given all the preceding words

this allows us to systematically compute the probability of a sequence by considering one word at a time, given the context of all previous words in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly