MEANING VIA DISTRIBUTIONAL SEMANTICS Flashcards
What is the distributional semantic model
Basic idea:
- Generate a high,multi-dimensional feature vector to characterise the meaning of a linguistic item
- Subsequently, the semantic similarity between the linguistic items can be quantified in terms of vector similarity, using measures like cosine or inner product between vectors
What are types of linguistic terms
words or sub-words, phrases, text pieces (windows of words), sentences, documents, etc
What is the distributional hypothesis
suggests that one can infer the meaning of a word just by looking at the context it occurs in
How do distributive semantics assess linguistic terms
assumes that contextual information alone constitutes a viable representation of linguistic items, in contrast to formal linguistics and the formal theory of grammar
What is the Vector space model (VSM)
The simplest distributional semantic model
Can build:
-A document-term matrix
-A term-context occurrence matrix
What is a document-term matrix
Columns : documents
Rows : terms
Each cell has how many times a term appears in that document
Then calculate the similarity between words using the row vectors
What is a term-context occurrence
First define a term’s context (eg 3 word in front and 3 before a term)
columns : terms
rows : general vocabulary used
In each cell is how many times we see the vocab word in the context of the term
(co-occurrences of a term and a word from vocab within the context window)
How do we decide on a context window size
Shorter windows (1-3 words) - indicate we are focused on syntax
Larger window - want to capture semantics
To handle large context window size, a more expressive model, larger data and more computing are needed
What are Sparse word vectors
VSM creates high-dimensional sparse word vectors
-Very high dimensions, e.g., 20,000-50,000.
-bad for storage
-Very sparse with most elements equal to zero.
What are dense word vectors
We can also create low-dimensional dense vectors
-Comparatively lower dimensions, e.g., 50-1000.
-Mostly non-zero elements.
This gives the benefit of being:
-easier to be used as features in machine learning models,
-less noisy
What is Latent semantic indexing
A classical method for low-dimensional dense representation from a document-term matrix
Mathematically (in algebra), the method is just the singular value decomposition (SVD)
How to carry out latent semantic indexing
We decompose our document-term matrix into 3 matrices (SVD)
X = UDV
D is a square matrix with non-zero values only in the diagonal
U and V are the two dimensions
We can calculate document vector : UD
We can calculate term vector : VD
The dimension of these vectors is k, and we can choose which k to use - usually low value
e.g., 50-1000 given
20,000-50,000 terms
What is Truncated SVD
reduces the dimensionality of the original matrix by selecting a k value lower than the rank of the matrix
Useful when dealing with large, sparse matrices, as it allows for a more compact representation while retaining significant information
(can be applied to any sparse matrix not just document-term)
-may lose data
Choosing k (degrees of freedom)
If we get 3 repeating vector patterns in out document-term matrix this means we have 3 DOF and should chose k=3
If we chose a lower k value we will lose data (doing an approximation)
And we can chose a higher value to be more accurate but -> more computation, more storage space
What is predictive word embedding models
Calculate word embeddings by performing prediction tasks based on word co-occurrence information
1) Define your context.
2) Define prediction task
Eg to predict whether a word appears in a context of a target word (word2vec including two versions of CBOW and skip-gram)
Or to predict how many times a word appears in the context texts of a
target word (GloVe)