Week 7 - Distributional Semantics Flashcards
Semantic Processing
The computer needs to “understand” what words mean in a given context
Distributional Hypothesis
The hypothesis that we can infer the meaning of a word from the context it occurs in
Assumes contextual information alone constitutes a viable representation of linguistic items, in contrast to formal linguistics and the formal theory of grammar
Distributional Semantic Model
Generate a high-dimensional feature vector to characterise a linguistic item
Subsequently, the semantic similarity between the linguistic items can be quantified in terms of vector similarity
Linguistic Items
words (or word senses), phrases, text pieces (windows of words), sentences, documents, etc…
Semantic space
The high-dimensional space computed by the distributional semantic model, also called embeding space, (latent) representation space, etc…
Vector distance function
Used to measure how dissimilar two vectors corresponding linguistic items are
Vector similarity function
Used to measure how similar two vectors corresponding linguistic items are
Examples of vector distance/similarity function
Euclidean Distance
Cosine Similarity
Inner Product Similarity
Euclidean Distance
Given two d-dimensional vectors p and q:
sqrt( sum(pi-qi)^2 for i=0->d) )
Inner Product Function
Given two d-dimensional vectors p and q:
sum(pi*qi) for i=0->d
Cosine Function
Given two d-dimensional vectors p and q:
sum(pi*qi) for i=0->d
divided by
sqrt( sum(pi^2) for i=0->d )
* sqrt( sum(qi^2) for i=0->d )
Vector Space Model
count based
Algebraic model for representing a piece of text object (referred to as a document) as a vector of indexed terms (e.g. words, phrases)
In the document vector, each feature value represents the count of an indexed term appearing in a relevant piece of text
By collecting many document vectors and storing them as matrix rows (or columns), it results in the document-term matrix.
Might treat the context of a word as a mini-document
VSM term weighting schemes
Binary Weight
Term Frequency (tf)
Term Frequency Inverse Document Frequency (tf-idf)
VSM binary weighting
Each element in the document-term matrix is the binary presence (or absence) of a word in a document
VSM Term Frequency Weighting
Each element in the document-term matrix is the frequency a word appears in a document, called term frequency (tf)
Inverse Document Frequency
Considers how much information the word provides, i.e. if it’s common or rare across all documents
idf(k) = log(M / m(k))
Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k
Term Frequency Inverse Document Frequency
For document i and word k
t(i,k) = tf(i,k) * idf(k)
idf(k) = log(M / m(k))
Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k
idf considers how much information the word provides, i.e. if it’s common or rare across all documents
VSM for word similarity
Construct two vectors using the VSM (Vector space model)
Use cosine (or inner product) similarity to compute the similarity between the word vectors
Two approaches for getting word vectors:
Based on documents
Based on local context
Context based word similarity
Instead of using a document-term matrix, use a word-term matrix, populating it with:
the co-occurence of a term and a word within the given context windows of the term, as observed in a text collection.
Context Engineering
How to choose the context for context based word similarity
Options:
- Whole document that contains a word
- All words in a wide window of the target word
- content words only (no stop words) in a wide window of the target word.
- Content words only in a narrower window of the target word
- Content words in a narrower window of the target word, which are further selected through using some lexical processing tools
Document-Term Matrix
A matrix with columns of terms, and rows of documents
Context Window Size
In the context of Context Engineering and Context based word similarity:
- Instead of entire document, use smaller contexts, e.g. paragraph, window of +- 4 words
Shorter window (1-3 words) - focus on syntax
Longer window (4-10 words) - capture semantics
Benefits of low-dimensional dense vectors
-Easier to use as features in machine learning models
- Less noisy
Latent Semantic Indexing
Mathematically is a singular value decomposition
Using SVD results computes:
document vectors: UD
term vectors: VD
Between document similarity: U * D^2 * UT
Between term similarity: V * D^2 * VT
The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms
Singular Value Decomposition
Decomposing a vector into three matrix components S, V, and D.
If there are m documents and n terms, and X is an mxn matrix:
X = U D V^T
Each row of V is a k-dimensional vector related to a term, it has n columns
Each row of U is a k-dimensional vector related to a document
The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms
Truncated SVD
Choose a number of k and truncate the SVD
Each row of UD provides a k-dimensional feature vector to characterise the row object
Each row of VD provides a k-dimensional feature vector to characterise the column object
Can be applied to any data matrix, not necessarily only the document-term matrix
Predictive word embedding models
Perform prediction tasks based on word co-occurence information. E.g:
- Whether a word appears in a context of a target word
- How many times a word appears in the context texts of a target word
Examples for training are words and their context in a text corpus
Include:
- (general) continuous bag-of-words model
- skip-gram model
- GloVe model
- …
Continuous-bag-of-words model
Assuming there are V words in the vocabulary we are dealing with a V-class classification task.
The input of each sample contains C context words
Objective is to learn a word embedding matrix of V rows and N columns (N is a hyperparameter) called W
Objective is to learn a word embedding for the vocabulary
Inputs to the model are one-hot encodings of the vocabulary where the non-zero element represents the context word being input.
Feature extraction component (h1 - output of hidden layer):
Copies the word embedding vectors for the context words from the rows of the embedding matrix, and averages them
Multi-class classification component:
Takes hi as the feature input and assigns it to one of the word classes in the vocabulary using logistic regression (a linear classification model trained using cross-entropy loss)
W’ is a matrix NxV which denotes the multi-class classification weight matrix of the logistic regression model
Like skip gram, based on wether two-words appear in each other’s context, doesn’t directly take into account the number of times two words appear in each other’s context.
Skip-gram model
Like a continuous bag of words model but flipped, predicting the context of target word, instead of predicting the target word from the context
Like CBOW, based on wether two-words appear in each other’s context, doesn’t directly take into account the number of times two words appear in each other’s context.
GloVe Model
Different from CBOW and Skip-gram utilises the frequency that a word appears in another word’s context in a given text corpus
What to do with a semantic space
Clustering
- grouping similar words together
Data visualisation
- mapping the semantic space to a two (or three) dimensional space
Support other NLP tasks
- To be used as the input of machine learning models. E.g. Neural networks for solving NLP tasks.
Advantages and disadvantages of distributional semantics
Advantages:
- Very practical in terms of processing
- Effective in capturing word meaning and relations, and support the training of neural language models
Disadvantages:
- It is still an open issue whether statistical co-occurrences alone are enough to address deeper semantic questions
- Semantic similarity is still a vague notion. For instance, the association between “car” and “van” is different from that between “car” and “wheel” (semantic similarity vs semantic relatedness)
- What type of semantic information can be captured from context, and what part of the meaning of words remains unreachable without complementary knowledge?
Disadvantages of distributional semantics