LANGUAGE MODELS AND REPRESENTATION Flashcards
What is a model
an abstract representation of something,
often in a computational form
– e.g. tossing a coin
What is the BoW representation
The simplest model
Unordered bag of words including repetitions
Disregards grammar and word order but keeps track of the frequency of each word
Can also have bag of terms, bag of tokens, back of stems, etc
Enhancing the BoW representation
Ranking words by using weights
skipping stop words
What is Zipf’s law
The frequency of any word in a given collection is inversely proportional to its rank in the frequency table
The most freq word appears twice as much as the second, thrice as much as the third etc
What is Luhn’s Hypothesis
The words exceeding the upper cutoff are considered to be common
Those below the lower cut-off are rare, and therefore not considered to be contributing significantly to the content of the article
What is vector representation
A way to implement the BoW model
Each document is represented as a vector
d = [w1,w2,w3,…]
Where w1 is the weight of term 1
This is a very high-dimensional representation
Millions of dimension
Very sparse vectors - most entries are zero
0 if a word does not exist, 1 if it does (if not weighted)
What does it mean to represent a term as an incidence
1 if word exists in doc,0 if it does not
What is document frequency (dft)
The number of documents that contain term t
Inverse measure of the ‘informativeness’ of t
What is the inverse document frequency
idft = log10 ( N / dft )
– High for terms that appear in few documents
– Low for terms that appear in many documents
N= total occurrences of t across all data
Why do we use log in idf
log scaling is used to dampen the effect of very high term frequencies
What is tf*idf weighting
Most common method for weighting
tf.idf = (1 + log10tf) x log10 (N / dft)
Note: still a very sparse model
When does tf.idf increase
With the number of occurrences in a document(tf)
with the rarity of the term in the collection(idf)
Why do we adapt tf in the tf*idf equation
1+ is a smoothing factor to prevent a 0 where a term does not exist
log10 log scaling is used to dampen the effect of very high term frequencies
Unigrams/bigrams/trigrams
uni: this, is, a, sentence
bi: this is, is a, a sentence
tri; this is a, is a sentence
character unigram/ bigram
t, h, i, s
th, hi, is
What is a probabilistic language model
a function that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability
measures the probability of natural language utterances, giving higher scores to those that are more common
What is a language model for words that appear independently
P(S) = p(w1) x p(w2) x …x p(wn)
Simplest unigram language model
p(w) = #(w) / Sumt#(t)
Where #w is the count of occurrences of the word w
and
Sumt#(t) is the total number of words in the training data
How to deal with OOV in the language model
OOV = 0 is too harsh because then we will never be able to model something unseen
+1 smoothing
Why is the main issue of unigram language modelling
complete loss of word order - different ordering of sentences have the same probability
Using the Chain rule
S = the cat in the hat
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|the cat in) · p(hat|the cat in the)
Markov assumption - take only the one (or two) previous words
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)
Chain rule: unigram v bigram
Unigram
P(S) = p(the) · p(cat)· p(in) · p(the) · p(hat)
Bigram
P(S) = p(the) · p(cat|the) · p(in|cat) ·
p(the|in) · p(hat|the)
Trigram
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)
Count-based estimation for bigrams
Probability that a word (w1) appears after another word (w2) divided by the total occurrences of w2
<s> I ...
<s> To ...
<s> I ...
P ( I | <s>) = 2/3
P(To |<s>) = 1/3
</s></s></s></s></s>
Why are n-grams insufficient language models
because language has long-distance dependencies
“The computer which I had just put into the machine room on the fifth floor crashed.”
But often we can get away with N-gram models