Week 2 - Word Vectors, Language modelling Flashcards
What are embeddings
learned representations of the meanings of words
numerical Vector-based
What is meaning in terms of vector semantic
Meaning of a word is determined by how it is used in context within a language
Words with similar contexts tend to have similar meanings
What are the two types of sparse vectors
Tf*idf
PPMI
What is tf*idf
tf = - number of times a term appears in a document
calculated using log10(count(t,d) + 1)
idf = - inverse of the number of documents containing that word
calculated as log10(N / df(t))
N = total num of documents
df(t) = total number of documents containing t
What is PPMI
Positive Pointwise Mutual Information
Based on a term-term (word co-occurence) matrix
Compares how often a target word w and context word c co-occur
With how often they occur independently
Why +ve in PPMI
PMI gives a score ranging from negative inf to positive inf
PPMI replaces all negative with 0
Calculate ppmi = max(log2 pij / pipj, 0 ) from table
pij = co-occurrence count divided by whole table total
pi* = sum all target word counts divided by whole table total
p*j = sum all context word counts divided by whole table total
What are the 3 types of dense vectors
word2vec
GloVe
fastText
What is word2vec
2 implementations:
Skip-gram model
Predict the context words given target word
Continuous BoW
Predict the most likely current word(target), given the context
Reverse of each other
What is GloVe
Uses number of times (frequency) that a word appears in another word’s context (word2vec only does yes or no)
1) constructs a word co-occurrence matrix from a large corpus of text - relies on global (corpus level) statistics
2) computes probability ratios
compares p(k |ice) / p(k |steam)
high ratio score -> k is much more likely to occur with ice than steam
What does a ratio score of 1 imply (glove)
Likely to be non-discriminative words like “water” or “fashion” and can be cancelled out
What is asymmetric vs symmetric context windows
Asymmetric - only look at context before or after target word
Symmetric - look at both sides
What is fastText
Is able to handle subwords representation: i.e., a bag of constituent n-grams
“eating” : ea eat ati tin…
A skip gram model is learned for each n -gram
A word is represented as the sum of its n-grams
Can handle unknown words because of this
What does each dimension (element) in a sparse vector correspond to
Either:
- a document in a large corpus (in a term-document matrix, e.g., in TF-IDF)
- a word in the vocabulary (in a term-term matrix, e.g., in PPMI)
How do sparse and dense vectors differ in dimension
Sparse vectors can have a large number of dimensions (10k - large vocabularies and numbers of documents)
While dense vector are in the hundreds (100-500)
What type of vectors do training models prefer
Dense vectors
Learn better if there are far fewer weights