LANGUAGE MODELS AND REPRESENTATION Flashcards by Isabel Draper

What is a model

an abstract representation of something,
often in a computational form
– e.g. tossing a coin

How well did you know this?

Not at all

Perfectly

What is the BoW representation

The simplest model
Unordered bag of words including repetitions
Disregards grammar and word order but keeps track of the frequency of each word
Can also have bag of terms, bag of tokens, back of stems, etc

How well did you know this?

Not at all

Perfectly

Enhancing the BoW representation

Ranking words by using weights
skipping stop words

How well did you know this?

Not at all

Perfectly

What is Zipf’s law

The frequency of any word in a given collection is inversely proportional to its rank in the frequency table
The most freq word appears twice as much as the second, thrice as much as the third etc

How well did you know this?

Not at all

Perfectly

What is Luhn’s Hypothesis

The words exceeding the upper cutoff are considered to be common
Those below the lower cut-off are rare, and therefore not considered to be contributing significantly to the content of the article

How well did you know this?

Not at all

Perfectly

What is vector representation

A way to implement the BoW model
Each document is represented as a vector
d = [w1,w2,w3,…]
Where w1 is the weight of term 1

This is a very high-dimensional representation
Millions of dimension
Very sparse vectors - most entries are zero
0 if a word does not exist, 1 if it does (if not weighted)

How well did you know this?

Not at all

Perfectly

What does it mean to represent a term as an incidence

1 if word exists in doc,0 if it does not

How well did you know this?

Not at all

Perfectly

What is document frequency (dft)

The number of documents that contain term t
Inverse measure of the ‘informativeness’ of t

How well did you know this?

Not at all

Perfectly

What is the inverse document frequency

idft = log10 ( N / dft )
– High for terms that appear in few documents
– Low for terms that appear in many documents

N= total occurrences of t across all data

How well did you know this?

Not at all

Perfectly

Why do we use log in idf

log scaling is used to dampen the effect of very high term frequencies

How well did you know this?

Not at all

Perfectly

What is tf*idf weighting

Most common method for weighting
tf.idf = (1 + log10tf) x log10 (N / dft)

Note: still a very sparse model

How well did you know this?

Not at all

Perfectly

When does tf.idf increase

With the number of occurrences in a document(tf)
with the rarity of the term in the collection(idf)

How well did you know this?

Not at all

Perfectly

Why do we adapt tf in the tf*idf equation

1+ is a smoothing factor to prevent a 0 where a term does not exist
log10 log scaling is used to dampen the effect of very high term frequencies

How well did you know this?

Not at all

Perfectly

Unigrams/bigrams/trigrams

uni: this, is, a, sentence
bi: this is, is a, a sentence
tri; this is a, is a sentence

How well did you know this?

Not at all

Perfectly

character unigram/ bigram

t, h, i, s
th, hi, is

How well did you know this?

Not at all

Perfectly

What is a probabilistic language model

Study These Flashcards

a function that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability
measures the probability of natural language utterances, giving higher scores to those that are more common

What is a language model for words that appear independently

Study These Flashcards

P(S) = p(w1) x p(w2) x …x p(wn)

Simplest unigram language model

Study These Flashcards

p(w) = #(w) / Sumt#(t)
Where #w is the count of occurrences of the word w
and
Sumt#(t) is the total number of words in the training data

How to deal with OOV in the language model

Study These Flashcards

OOV = 0 is too harsh because then we will never be able to model something unseen
+1 smoothing

Why is the main issue of unigram language modelling

Study These Flashcards

complete loss of word order - different ordering of sentences have the same probability

Using the Chain rule

Study These Flashcards

S = the cat in the hat
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|the cat in) · p(hat|the cat in the)

Markov assumption - take only the one (or two) previous words
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)

Chain rule: unigram v bigram

Study These Flashcards

Unigram
P(S) = p(the) · p(cat)· p(in) · p(the) · p(hat)

Bigram
P(S) = p(the) · p(cat|the) · p(in|cat) ·
p(the|in) · p(hat|the)

Trigram
P(S) = p(the) · p(cat|the) · p(in|the cat) ·
p(the|cat in) · p(hat|in the)

Count-based estimation for bigrams

Study These Flashcards

Probability that a word (w1) appears after another word (w2) divided by the total occurrences of w2

Why are n-grams insufficient language models

Study These Flashcards

because language has long-distance dependencies

“The computer which I had just put into the machine room on the fifth floor crashed.”

But often we can get away with N-gram models

What is an evaluation metric

tells us how well our model does on the test set

What is extrinsic evaluation of LMs

Best evaluation Run Models A and B on a task compare their accuracy time consuming

What is intrinsic evaluation of LMs

Dont apply the model to a task just assess internal qualities like giving best probabilities for a test set but it is often a bad approximation unless the test and training data is very similar

The overfitting of Ngrams

- N-grams only work well for word prediction if the test corpus looks like the training corpus - We need to generalise and avoid overfitting

Maximum likelihood estimates

If we see the word 'computer' 30 times in a 10,000 word corpus MLE = 0.003 This may be a bad estimate for other corpus

Laplace smoothing

Using the +1 smoothing

How do we train for unknown words

Create word token at training we change unknown words to Use probabilities for any word not in training data

LANGUAGE MODELS AND REPRESENTATION Flashcards

(31 cards)