LANGUAGE MODELS AND REPRESENTATION Flashcards
What is a model
an abstract representation of something,
often in a computational form
– e.g. tossing a coin
What is the BoW representation
The simplest model
Unordered bag of words including repetitions
Disregards grammar and word order but keeps track of the frequency of each word
Can also have bag of terms, bag of tokens, back of stems, etc
Enhancing the BoW representation
Ranking words by using weights
skipping stop words
What is Zipf’s law
The frequency of any word in a given collection is inversely proportional to its rank in the frequency table
The most freq word appears twice as much as the second, thrice as much as the third etc
What is Luhn’s Hypothesis
The words exceeding the upper cutoff are considered to be common
Those below the lower cut-off are rare, and therefore not considered to be contributing significantly to the content of the article
What is vector representation
A way to implement the BoW model
Each document is represented as a vector
d = [w1,w2,w3,…]
Where w1 is the weight of term 1
This is a very high-dimensional representation
Millions of dimension
Very sparse vectors - most entries are zero
0 if a word does not exist, 1 if it does (if not weighted)
What does it mean to represent a term as an incidence
1 if word exists in doc,0 if it does not
What is document frequency (dft)
The number of documents that contain term t
Inverse measure of the ‘informativeness’ of t
What is the inverse document frequency
idft = log10 ( N / dft )
– High for terms that appear in few documents
– Low for terms that appear in many documents
N= total occurrences of t across all data
Why do we use log in idf
log scaling is used to dampen the effect of very high term frequencies
What is tf*idf weighting
Most common method for weighting
tf.idf = (1 + log10tf) x log10 (N / dft)
Note: still a very sparse model
When does tf.idf increase
With the number of occurrences in a document(tf)
with the rarity of the term in the collection(idf)
Why do we adapt tf in the tf*idf equation
1+ is a smoothing factor to prevent a 0 where a term does not exist
log10 log scaling is used to dampen the effect of very high term frequencies
Unigrams/bigrams/trigrams
uni: this, is, a, sentence
bi: this is, is a, a sentence
tri; this is a, is a sentence
character unigram/ bigram
t, h, i, s
th, hi, is
What is a probabilistic language model
a function that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability
measures the probability of natural language utterances, giving higher scores to those that are more common
What is a language model for words that appear independently
P(S) = p(w1) x p(w2) x …x p(wn)
Simplest unigram language model
p(w) = #(w) / Sumt#(t)
Where #w is the count of occurrences of the word w
and
Sumt#(t) is the total number of words in the training data
How to deal with OOV in the language model
OOV = 0 is too harsh because then we will never be able to model something unseen
+1 smoothing
Why is the main issue of unigram language modelling
complete loss of word order - different ordering of sentences have the same probability
Using the Chain rule
S = the cat in the hat
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|the cat in) · p(hat|the cat in the)
Markov assumption - take only the one (or two) previous words
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)
Chain rule: unigram v bigram
Unigram
P(S) = p(the) · p(cat)· p(in) · p(the) · p(hat)
Bigram
P(S) = p(the) · p(cat|the) · p(in|cat) ·
p(the|in) · p(hat|the)
Trigram
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)
Count-based estimation for bigrams
Probability that a word (w1) appears after another word (w2) divided by the total occurrences of w2
<s> I ...
<s> To ...
<s> I ...
P ( I | <s>) = 2/3
P(To |<s>) = 1/3
</s></s></s></s></s>
Why are n-grams insufficient language models
because language has long-distance dependencies
“The computer which I had just put into the machine room on the fifth floor crashed.”
But often we can get away with N-gram models
What is an evaluation metric
tells us how well our model does on the test set
What is extrinsic evaluation of LMs
Best evaluation
Run Models A and B on a task
compare their accuracy
time consuming
What is intrinsic evaluation of LMs
Dont apply the model to a task
just assess internal qualities
like giving best probabilities for a test set
but it is often a bad approximation unless the test and training data is very similar
The overfitting of Ngrams
- N-grams only work well for word prediction if the test corpus looks like the training corpus
- We need to generalise and avoid overfitting
Maximum likelihood estimates
If we see the word ‘computer’ 30 times in a 10,000 word corpus
MLE = 0.003
This may be a bad estimate for other corpus
Laplace smoothing
Using the +1 smoothing
How do we train for unknown words
Create word token <UNK>
at training we change unknown words to <UNK>
Use <UNK> probabilities for any word not in training
data</UNK></UNK></UNK>