lecture 3 Flashcards
corpus
- large/complete collection of writing
- linguistics. a body of utterances, as words or sentences, assumed to be representative of and used for lexical, grammatical, or other linguistic analysis
properties of corpora
- naturally occurring corpora serve as realistic samples of language
- metadata: side information about where the samples come from
- corpora with linguistic annotations
markup (formatting)
common formats for structuring linguistic data
–> e.g., XML, CoNLL, JSON
preprocessing
transforming text into a useful format
–> e.g., tokenization
splitting sentences into words/tokens
- using spaces as delimiters
- tokenization
- stemming
- lemmatization
tokenization
each word, including punctuation, is considered a separate token and assigned a sequential position
stemming + process
consider words within sentences that share a common root, but appear in different forms
reduce word to base/root form
–> this enables us to treat them as a single unit for analysis
process:
1. tokenize sentences
2. use stemming to reduce words into common stems
lemmatization
reduce words to common lemma’s
takes context into account to convert words to their base dictionary form
results in more linguistically correct forms than stemming
bag of words
text is represented by the frequency of its words, disregarding grammar and word order
considers their occurrences as features for analysis
simplifies text into numerical data
BOW process
- text is converted to list of words
- words are placed into metaphorical bag
- each word’s frequency is counted
- frequency count is represented in table format
n-gram
- word sequence of length n
- the more words, the more information is revealed
- compute probabilities of these sequences
n-gram probabilities
calculation: occurrence of the sequence / size of the corpus
words in a corpus could appear multiple times. this affects the calculation of probabilities and the overall model we create
it is therefore essential that n-grams model the probability distribution of all possible sequences of all tokens in the corpus
what do n-gram models evaluate
the probability of a sequence of n tokens
this is useful for determining the likelihood of specific word sequences and is represented by the joint probability
chain rule
product of conditional probabilities
indicates that the probability of the entire sequence is the product of the individual probabilities of each word given all the preceding words
this allows us to systematically compute the probability of a sequence by considering one word at a time, given the context of all previous words in the sequence