lecture 6 Flashcards
Different forms of (static) word embeddings
(representing words mathematically)
- One-hot vector: Vector length = vocabulary size; value of each word is 1 at its index, 0 elsewhere
- TF-IDF: Multiply # of occurrences of word in corpus (term frequency, TF) by inverse document frequency (IDF, total # of documents divided by # of
documents with target word) - Skip-Gram: Predict surrounding/context words for given input/target word (Word2Vec)
- COW: Predict target word given surrounding/context words (Word2Vec)
- FastText: Similar to Word2Vec but using n-gram variations of (sub)words
- GloVe: Similar to GloVe, but uses corpus statistics and co-occurrence matrix to calculate more global contexts
universal dependencies tagset classes
- open class
- closed class words
- other
UDT: open class
- adj: noun modifier describing properties (bnw)
- adv: adverb - verb modifiers of time, place, manner (bijwoord)
- noun: words for persons, places, things (znw)
- verb: words for actions and processes (werkwoord)
- propn: proper noun - name of a person, organization, place, etc.
- intj: interjection - exclamation, greeting, yes/no response
UDT: closed class words
- adp: adposition - spatial, temporal, etc. relation (kastwoord)
- aux: auxiliary - helpin verb marking tense (can, may, should, are)
- cconj: coordinating conjunction - joins two phrases (and, or, but)
- num: numeral (one, two, first)
- part: particle - preposition-like for used together with a verb (up, down, on, off)
- pron: pronoun - showrthand for referring to an entity or event (she, who, I, others)
- sconj: subordinating conjunction - joins a main clause with a subordinate clause (that, which)
UDT: other
- punct - punctuation
- sym - symbols like $
- x - other (asdf, qwfg)
Why might it be useful to predict upcoming words or assign probabilities to sentences?
the ability to predict upcoming words or assign sentence probabilities enhances machine interaction with language across various applications, leading to more intuitive, accurate, and efficient technological solutions
one-hot vector
- localist
- uniquely identifies each word with a sparse vector of zeros and a single one
word2vec + process
- learns word embeddings by predicting word contexts
- skip-gram algorithm
- start with large collection of text, essentially a vast list of words
- every word is represented by a vector
- go through each position t in a text, which has a center c and context words o
- use similarity of the word vectors for c and o to calculate the probability of c given o/ o given c
- keep adjusting word vectors to maximize this probability
word embeddings
- distributed
- map words to dense vectors in a continuous vector space
cosine similarity
quantifies similarity between two vectors by calculating the cosine of the angle between them
representing words as discrete symbols
- localist, e.g., one-hot
- vector dimension = number of words in the vocabulary
problem with representing words as discrete symbols + solution
- this method makes words distinct from each other, but doesnt capture relationships between them
- two vectors are orthogonal, meaning there is no similarity for one-hot vectors
- bad solution: wordnet synonyms for similarity
- good solution: encode similarity in the vectors themselves
what drives semantic similarity
- meaning: two concepts are close in terms of meaning (semantic closeness)
–> accidental & inadvertent - world knowledge: two concepts have similar properties, often occur together, or occur in similar contexts
–> UPS & FedEx - psychology: two concepts fit together in an overarching psychological schema or framework
–> milennial & avocado
how do we approximate semantic similarity
representing words by their context: distributional semantics
distributional semantics
a word’s meaning is given by the words that appear frequently close-by
‘you shall know a word by the company it keeps’
context in distributional semantics
the set of words that appear nearby a target word within a fixed-size window
word vectors (word embeddings/word representations)
- distributed representation
- we build a dense word vector for each word, chosen so that it’s similar to the vectors of words that appear in similar contexts
- measures similarity as the dot product
–> dot products for one-hot vectors = 0
properties of dense word embeddings
- they encode semantic and syntactic relationships
- the can probe relations between words using vector arithmetic
–> king - male + female
problem with word embeddings
they can reinforce and propagate biases present in the data they are trained on
skip-gram algorithm
- learning word vectors to predict the surrounding words
- randomly initialize word vector for each word in the vocabulary
- go through each position t (with c and o)
–> to identify context words, we define a window of size j, which means our model will look at words in position t +/- j as the context
word2vec objective function
- likelihood L(theta): maximizing the likelihood of context words (u, o) given the center word (v, c)
–> P(O|C) - objective function J(theta): average negative loglikelihood
–> log of likelihood, multiplied by -1
–> we want to minimize the objective function
minimizing objective function <–> maximizing predictive accuracy