LANGUAGE MODELS AND REPRESENTATION Flashcards

1
Q

What is a model

A

an abstract representation of something,
often in a computational form
– e.g. tossing a coin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the BoW representation

A

The simplest model
Unordered bag of words including repetitions
Disregards grammar and word order but keeps track of the frequency of each word
Can also have bag of terms, bag of tokens, back of stems, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Enhancing the BoW representation

A

Ranking words by using weights
skipping stop words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Zipf’s law

A

The frequency of any word in a given collection is inversely proportional to its rank in the frequency table
The most freq word appears twice as much as the second, thrice as much as the third etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Luhn’s Hypothesis

A

The words exceeding the upper cutoff are considered to be common
Those below the lower cut-off are rare, and therefore not considered to be contributing significantly to the content of the article

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is vector representation

A

A way to implement the BoW model
Each document is represented as a vector
d = [w1,w2,w3,…]
Where w1 is the weight of term 1

This is a very high-dimensional representation
Millions of dimension
Very sparse vectors - most entries are zero
0 if a word does not exist, 1 if it does (if not weighted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean to represent a term as an incidence

A

1 if word exists in doc,0 if it does not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is document frequency (dft)

A

The number of documents that contain term t
Inverse measure of the ‘informativeness’ of t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the inverse document frequency

A

idft = log10 ( N / dft )
– High for terms that appear in few documents
– Low for terms that appear in many documents

N= total occurrences of t across all data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why do we use log in idf

A

log scaling is used to dampen the effect of very high term frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is tf*idf weighting

A

Most common method for weighting
tf.idf = (1 + log10tf) x log10 (N / dft)

Note: still a very sparse model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When does tf.idf increase

A

With the number of occurrences in a document(tf)
with the rarity of the term in the collection(idf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we adapt tf in the tf*idf equation

A

1+ is a smoothing factor to prevent a 0 where a term does not exist
log10 log scaling is used to dampen the effect of very high term frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unigrams/bigrams/trigrams

A

uni: this, is, a, sentence
bi: this is, is a, a sentence
tri; this is a, is a sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

character unigram/ bigram

A

t, h, i, s
th, hi, is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a probabilistic language model

A

a function that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability
measures the probability of natural language utterances, giving higher scores to those that are more common

17
Q

What is a language model for words that appear independently

A

P(S) = p(w1) x p(w2) x …x p(wn)

18
Q

Simplest unigram language model

A

p(w) = #(w) / Sumt#(t)
Where #w is the count of occurrences of the word w
and
Sumt#(t) is the total number of words in the training data

19
Q

How to deal with OOV in the language model

A

OOV = 0 is too harsh because then we will never be able to model something unseen
+1 smoothing

20
Q

Why is the main issue of unigram language modelling

A

complete loss of word order - different ordering of sentences have the same probability

21
Q

Using the Chain rule

A

S = the cat in the hat
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|the cat in) · p(hat|the cat in the)

Markov assumption - take only the one (or two) previous words
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)

22
Q

Chain rule: unigram v bigram

A

Unigram
P(S) = p(the) · p(cat)· p(in) · p(the) · p(hat)

Bigram
P(S) = p(the) · p(cat|the) · p(in|cat) ·
p(the|in) · p(hat|the)

Trigram
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)

23
Q

Count-based estimation for bigrams

A

Probability that a word (w1) appears after another word (w2) divided by the total occurrences of w2

<s> I ...
<s> To ...
<s> I ...

P ( I | <s>) = 2/3
P(To |<s>) = 1/3
</s></s></s></s></s>

24
Q

Why are n-grams insufficient language models

A

because language has long-distance dependencies

“The computer which I had just put into the machine room on the fifth floor crashed.”

But often we can get away with N-gram models

25
Q

What is an evaluation metric

A

tells us how well our model does on the test set

26
Q

What is extrinsic evaluation of LMs

A

Best evaluation
Run Models A and B on a task
compare their accuracy

time consuming

27
Q

What is intrinsic evaluation of LMs

A

Dont apply the model to a task
just assess internal qualities
like giving best probabilities for a test set

but it is often a bad approximation unless the test and training data is very similar

28
Q

The overfitting of Ngrams

A
  • N-grams only work well for word prediction if the test corpus looks like the training corpus
  • We need to generalise and avoid overfitting
29
Q

Maximum likelihood estimates

A

If we see the word ‘computer’ 30 times in a 10,000 word corpus
MLE = 0.003
This may be a bad estimate for other corpus

30
Q

Laplace smoothing

A

Using the +1 smoothing

31
Q

How do we train for unknown words

A

Create word token <UNK>
at training we change unknown words to <UNK>
Use <UNK> probabilities for any word not in training
data</UNK></UNK></UNK>