LANGUAGE MODELS AND REPRESENTATION Flashcards
What is a model
an abstract representation of something,
often in a computational form
– e.g. tossing a coin
What is the BoW representation
The simplest model
Unordered bag of words including repetitions
Disregards grammar and word order but keeps track of the frequency of each word
Can also have bag of terms, bag of tokens, back of stems, etc
Enhancing the BoW representation
Ranking words by using weights
skipping stop words
What is Zipf’s law
The frequency of any word in a given collection is inversely proportional to its rank in the frequency table
The most freq word appears twice as much as the second, thrice as much as the third etc
What is Luhn’s Hypothesis
The words exceeding the upper cutoff are considered to be common
Those below the lower cut-off are rare, and therefore not considered to be contributing significantly to the content of the article
What is vector representation
A way to implement the BoW model
Each document is represented as a vector
d = [w1,w2,w3,…]
Where w1 is the weight of term 1
This is a very high-dimensional representation
Millions of dimension
Very sparse vectors - most entries are zero
0 if a word does not exist, 1 if it does (if not weighted)
What does it mean to represent a term as an incidence
1 if word exists in doc,0 if it does not
What is document frequency (dft)
The number of documents that contain term t
Inverse measure of the ‘informativeness’ of t
What is the inverse document frequency
idft = log10 ( N / dft )
– High for terms that appear in few documents
– Low for terms that appear in many documents
N= total occurrences of t across all data
Why do we use log in idf
log scaling is used to dampen the effect of very high term frequencies
What is tf*idf weighting
Most common method for weighting
tf.idf = (1 + log10tf) x log10 (N / dft)
Note: still a very sparse model
When does tf.idf increase
With the number of occurrences in a document(tf)
with the rarity of the term in the collection(idf)
Why do we adapt tf in the tf*idf equation
1+ is a smoothing factor to prevent a 0 where a term does not exist
log10 log scaling is used to dampen the effect of very high term frequencies
Unigrams/bigrams/trigrams
uni: this, is, a, sentence
bi: this is, is a, a sentence
tri; this is a, is a sentence
character unigram/ bigram
t, h, i, s
th, hi, is