lecture 6 Flashcards
Different forms of (static) word embeddings
(representing words mathematically)
- One-hot vector: Vector length = vocabulary size; value of each word is 1 at its index, 0 elsewhere
- TF-IDF: Multiply # of occurrences of word in corpus (term frequency, TF) by inverse document frequency (IDF, total # of documents divided by # of
documents with target word) - Skip-Gram: Predict surrounding/context words for given input/target word (Word2Vec)
- COW: Predict target word given surrounding/context words (Word2Vec)
- FastText: Similar to Word2Vec but using n-gram variations of (sub)words
- GloVe: Similar to GloVe, but uses corpus statistics and co-occurrence matrix to calculate more global contexts
universal dependencies tagset classes
- open class
- closed class words
- other
UDT: open class
- adj: noun modifier describing properties (bnw)
- adv: adverb - verb modifiers of time, place, manner (bijwoord)
- noun: words for persons, places, things (znw)
- verb: words for actions and processes (werkwoord)
- propn: proper noun - name of a person, organization, place, etc.
- intj: interjection - exclamation, greeting, yes/no response
UDT: closed class words
- adp: adposition - spatial, temporal, etc. relation (kastwoord)
- aux: auxiliary - helpin verb marking tense (can, may, should, are)
- cconj: coordinating conjunction - joins two phrases (and, or, but)
- num: numeral (one, two, first)
- part: particle - preposition-like for used together with a verb (up, down, on, off)
- pron: pronoun - showrthand for referring to an entity or event (she, who, I, others)
- sconj: subordinating conjunction - joins a main clause with a subordinate clause (that, which)
UDT: other
- punct - punctuation
- sym - symbols like $
- x - other (asdf, qwfg)
Why might it be useful to predict upcoming words or assign probabilities to sentences?
the ability to predict upcoming words or assign sentence probabilities enhances machine interaction with language across various applications, leading to more intuitive, accurate, and efficient technological solutions
one-hot vector
- localist
- uniquely identifies each word with a sparse vector of zeros and a single one
word2vec + process
- learns word embeddings by predicting word contexts
- skip-gram algorithm
- start with large collection of text, essentially a vast list of words
- every word is represented by a vector
- go through each position t in a text, which has a center c and context words o
- use similarity of the word vectors for c and o to calculate the probability of c given o/ o given c
- keep adjusting word vectors to maximize this probability
word embeddings
- distributed
- map words to dense vectors in a continuous vector space
cosine similarity
quantifies similarity between two vectors by calculating the cosine of the angle between them
representing words as discrete symbols
- localist, e.g., one-hot
- vector dimension = number of words in the vocabulary
problem with representing words as discrete symbols + solution
- this method makes words distinct from each other, but doesnt capture relationships between them
- two vectors are orthogonal, meaning there is no similarity for one-hot vectors
- bad solution: wordnet synonyms for similarity
- good solution: encode similarity in the vectors themselves
what drives semantic similarity
- meaning: two concepts are close in terms of meaning (semantic closeness)
–> accidental & inadvertent - world knowledge: two concepts have similar properties, often occur together, or occur in similar contexts
–> UPS & FedEx - psychology: two concepts fit together in an overarching psychological schema or framework
–> milennial & avocado
how do we approximate semantic similarity
representing words by their context: distributional semantics
distributional semantics
a word’s meaning is given by the words that appear frequently close-by
‘you shall know a word by the company it keeps’
context in distributional semantics
the set of words that appear nearby a target word within a fixed-size window
word vectors (word embeddings/word representations)
- distributed representation
- we build a dense word vector for each word, chosen so that it’s similar to the vectors of words that appear in similar contexts
- measures similarity as the dot product
–> dot products for one-hot vectors = 0
properties of dense word embeddings
- they encode semantic and syntactic relationships
- the can probe relations between words using vector arithmetic
–> king - male + female
problem with word embeddings
they can reinforce and propagate biases present in the data they are trained on
skip-gram algorithm
- learning word vectors to predict the surrounding words
- randomly initialize word vector for each word in the vocabulary
- go through each position t (with c and o)
–> to identify context words, we define a window of size j, which means our model will look at words in position t +/- j as the context
word2vec objective function
- likelihood L(theta): maximizing the likelihood of context words (u, o) given the center word (v, c)
–> P(O|C) - objective function J(theta): average negative loglikelihood
–> log of likelihood, multiplied by -1
–> we want to minimize the objective function
minimizing objective function <–> maximizing predictive accuracy
calculating L(theta) = P(O|C)
softmax function: maps all arbitrary values Xi to a probability distribution Pi.
- exp(dot product of o and x)
- divided by normalization function
P(O|C): dot product
compares similarity of O and C.
a larger dot product means higher probability = higher similarity
P(O|C): exponentiation
exp of the dot product ensures a positive result
P(O|C): normalization
converts results into valid probabilities
= sum of all exponentiated dot products for all word pairs in the vocabulary
training the W2V skip-gram model
- gradually adjust all parameters in theta to minimize loss
- theta represents all model parameters (i.e., all word vectors) in one long vector. each word has 2 vectors: C and O.
- with d-dimensional vectors, and V-many words, theta = 2dV parameters.
- optimize parameters with gradient descent
redefining context windows
- maximum size of context window: different sizes can affect the quality and nature of the embedding
- weighting scheme: the model weights context words based on their distance from the target word
- relative position: symmetric, lef- or right side
- linguistic boundaries: e.g., sentence endings
sequence labeling
- assigning a categorical label to each element in a sequence of inputs (POS tags)
- pattern recognition task
- input: sequence of length n (x = x1…xn) (words)
- output: sequence of length n (y = y1…yn). each yi is a label of xi. (POS tags)
POS
- tags that allow us to label sequences in ways that mimic human understanding of them
- first step towards syntactic analysis
POS key questions
- given a sequence of tokens, how can we predict linguistic tags for each one
- how can we exploit the sequential nature of this data to model hidden features and structural dependencies that will help downstream tasks
motivation for POS tagging
- humans produce and process natural language sequentially
- but these possess a hidden structure (i.e., lexical, semantic, syntactic structure) that constrain possible sequences
- so, merely having the right words and tags isnt sufficient for proper interpretation and meaningful communication
- hidden structures allow humans to generalize to new infinite sequences (recursion)
how can we replicate human linguistic comprehension in NLP
- primary goal: given a sequence of tokens, predict (hidden) linguistic tags that allow for generalization across categories
- secondary goal: exploit additional hidden linguistic structure to make this task more accurate
–> i.e., the use of additional contextual and syntactic information that is not immediately apparent from the sequence of tokens alone
POS tagging tasks
- dependency parsing
- semantic parsing
- coreference resolution
- information extraction
- question answering
sequence labelling tasks
- POS tagging
- named entity recognition (NER)
- BIO chunking
two approaches for mimicking human language
- machine learning
- input
- feature extraction
- classification
- output - deep learning
- input
- FE + classification (both automatized)
- output
3 kinds of ML algorithms
- supervised
- unsupervised
- semi-supervised
feature engineering (ML)
- technique that leverages data to create new variables that aren’t in the training set. (both for supervised and unsupervised)
- goal: simplifying and speeding up data transformations and enhancing model accuracy
- use domain knowledge of the data to extract important linguistic information (features) and train a model on these
ML in practice: 2 tasks
- describing data with features a computer can understand
–> requires expert knowledge - learning algorithm
–> optimizing weights on features
sequence labelling classification can be:
- independent: each member is treated independenty
- dependent: each member is dependent on other members for its label
data in sequence labelling
open and closed classes
- closed class
–> english uses closed class words to express syntactic relations between words. these give clues to open class.
–> fixed membership - open class
–> membership is larger and can grow
POS is a disambiguation task
though word classes do share semantic tendencies, POS is primarily defined on
- grammatical relationship with neighboring words
- morphological properties of affixes: these help distinguish between words that are morphologically related but serve different roles in sentences
extent of POS ambiguity
many high frequency words have more than one POS tag
around 50% of tokens are ambiguous
methods for POS
- lexical based: assign POS tags that occur most frequently with words in training
- rule-based: assign POS tags based on rules
- probabilistic
- deep learning
hidden markov models (HMM)
- based on augmenting the markov chain
- allow us to talk about observed events (words) and hidden events (POS tags)
-
strong assumption: if we want to make a prediction about a future state, only the current state matters
–> this simplifies computation by disregarding the entire history of previous states - markov assumption: P(yi=a|yi-1)
- equivalent to bigram model
markov chain
a model that tells us something about the probabilities of sequences of random variables (states) which take the value of some set
- states are words (nodes)
- probabilities of transitioning from one state to another are transitions (edges)
- multiply starting probability by transition probabilities
start distribution pi
vector representing the probabilities of transitioning to each possible state (i.e., sums to 1)
HMM =
- hidden MC + observed variables
- transition matrix + emission matrix
viterbi algorithm (HMM)
- find optimal sequence of hidden states
- most likely weather (T) sequence for the observed mood (E) sequence
- compute probability corresponding to each possible permutation and find the one with maximum probability
–> argmax P(x1,y1…xn,yn)
from ML to DL
- ML
- HMMs show us to observe a sequence (x) and predict the labels of sequence tokens (y)
- only considers previous state, and often relies on hand-crafted features or language-specific resources for performance
- can be combined with algorithms to consider full context (backward/forward) - DL
- can automatically learn complex features of data and model multiple dependencies between tokens
- generalizable to out of vocabulary (OOV) words and languages.
reasons for exploring deep learning
- large amounts of training data favor deep learning
- faster machines and multicore CPU/GPUs favor DL
- new models, algorithms, ideas
- better, more flexible learning of intermediate representations
- effective end-to-end joint system learning
- effective learning methods for using contexts and transferring between tasks
- improved performance
perceptron
- basic building block of a neural netword
- takes one or more inputs and produces a single output
- each input is multiplied by a weight, and the products are summed together
- the sum is then passed through an activation function, which determines the output of the perceptron
- output = activation_function(weighted_sum_of_inputs + bias)
- weighted sum of inputs = dot product of input vector and weight vector
feed-forward network
- multiple layers of neurons
- can solve linearly-separable problems
- weights optimize the NN performance and represent unlabeled, distributed knowledge
feed forward network applications
- text classification: sentiment analysis, language detection
- unsupervised learning: word2vec, dimension reduction
to train a NN you need
- training set: ordered pairs each with input and targeted output
- loss function: a function to be optimized (e.g., cross entropy)
- optimizer: a method for adjusting weights (e.g., gradient descent)