lecture 6 Flashcards

1
Q

Different forms of (static) word embeddings

(representing words mathematically)

A
  1. One-hot vector: Vector length = vocabulary size; value of each word is 1 at its index, 0 elsewhere
  2. TF-IDF: Multiply # of occurrences of word in corpus (term frequency, TF) by inverse document frequency (IDF, total # of documents divided by # of
    documents with target word)
  3. Skip-Gram: Predict surrounding/context words for given input/target word (Word2Vec)
  4. COW: Predict target word given surrounding/context words (Word2Vec)
  5. FastText: Similar to Word2Vec but using n-gram variations of (sub)words
  6. GloVe: Similar to GloVe, but uses corpus statistics and co-occurrence matrix to calculate more global contexts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

universal dependencies tagset classes

A
  1. open class
  2. closed class words
  3. other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

UDT: open class

A
  1. adj: noun modifier describing properties (bnw)
  2. adv: adverb - verb modifiers of time, place, manner (bijwoord)
  3. noun: words for persons, places, things (znw)
  4. verb: words for actions and processes (werkwoord)
  5. propn: proper noun - name of a person, organization, place, etc.
  6. intj: interjection - exclamation, greeting, yes/no response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

UDT: closed class words

A
  1. adp: adposition - spatial, temporal, etc. relation (kastwoord)
  2. aux: auxiliary - helpin verb marking tense (can, may, should, are)
  3. cconj: coordinating conjunction - joins two phrases (and, or, but)
  4. num: numeral (one, two, first)
  5. part: particle - preposition-like for used together with a verb (up, down, on, off)
  6. pron: pronoun - showrthand for referring to an entity or event (she, who, I, others)
  7. sconj: subordinating conjunction - joins a main clause with a subordinate clause (that, which)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

UDT: other

A
  1. punct - punctuation
  2. sym - symbols like $
  3. x - other (asdf, qwfg)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why might it be useful to predict upcoming words or assign probabilities to sentences?

A

the ability to predict upcoming words or assign sentence probabilities enhances machine interaction with language across various applications, leading to more intuitive, accurate, and efficient technological solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

one-hot vector

A
  • localist
  • uniquely identifies each word with a sparse vector of zeros and a single one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

word2vec + process

A
  • learns word embeddings by predicting word contexts
  • skip-gram algorithm
  1. start with large collection of text, essentially a vast list of words
  2. every word is represented by a vector
  3. go through each position t in a text, which has a center c and context words o
  4. use similarity of the word vectors for c and o to calculate the probability of c given o/ o given c
  5. keep adjusting word vectors to maximize this probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

word embeddings

A
  • distributed
  • map words to dense vectors in a continuous vector space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

cosine similarity

A

quantifies similarity between two vectors by calculating the cosine of the angle between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

representing words as discrete symbols

A
  • localist, e.g., one-hot
  • vector dimension = number of words in the vocabulary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

problem with representing words as discrete symbols + solution

A
  • this method makes words distinct from each other, but doesnt capture relationships between them
  • two vectors are orthogonal, meaning there is no similarity for one-hot vectors
  • bad solution: wordnet synonyms for similarity
  • good solution: encode similarity in the vectors themselves
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what drives semantic similarity

A
  1. meaning: two concepts are close in terms of meaning (semantic closeness)
    –> accidental & inadvertent
  2. world knowledge: two concepts have similar properties, often occur together, or occur in similar contexts
    –> UPS & FedEx
  3. psychology: two concepts fit together in an overarching psychological schema or framework
    –> milennial & avocado
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how do we approximate semantic similarity

A

representing words by their context: distributional semantics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

distributional semantics

A

a word’s meaning is given by the words that appear frequently close-by

‘you shall know a word by the company it keeps’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

context in distributional semantics

A

the set of words that appear nearby a target word within a fixed-size window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

word vectors (word embeddings/word representations)

A
  • distributed representation
  • we build a dense word vector for each word, chosen so that it’s similar to the vectors of words that appear in similar contexts
  • measures similarity as the dot product
    –> dot products for one-hot vectors = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

properties of dense word embeddings

A
  1. they encode semantic and syntactic relationships
  2. the can probe relations between words using vector arithmetic
    –> king - male + female
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

problem with word embeddings

A

they can reinforce and propagate biases present in the data they are trained on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

skip-gram algorithm

A
  • learning word vectors to predict the surrounding words
  1. randomly initialize word vector for each word in the vocabulary
  2. go through each position t (with c and o)

–> to identify context words, we define a window of size j, which means our model will look at words in position t +/- j as the context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

word2vec objective function

A
  1. likelihood L(theta): maximizing the likelihood of context words (u, o) given the center word (v, c)
    –> P(O|C)
  2. objective function J(theta): average negative loglikelihood
    –> log of likelihood, multiplied by -1
    –> we want to minimize the objective function

minimizing objective function <–> maximizing predictive accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

calculating L(theta) = P(O|C)

A

softmax function: maps all arbitrary values Xi to a probability distribution Pi.

  • exp(dot product of o and x)
  • divided by normalization function
22
Q

P(O|C): dot product

A

compares similarity of O and C.

a larger dot product means higher probability = higher similarity

23
Q

P(O|C): exponentiation

A

exp of the dot product ensures a positive result

24
Q

P(O|C): normalization

A

converts results into valid probabilities

= sum of all exponentiated dot products for all word pairs in the vocabulary

25
Q

training the W2V skip-gram model

A
  • gradually adjust all parameters in theta to minimize loss
  • theta represents all model parameters (i.e., all word vectors) in one long vector. each word has 2 vectors: C and O.
  • with d-dimensional vectors, and V-many words, theta = 2dV parameters.
  • optimize parameters with gradient descent
26
Q

redefining context windows

A
  1. maximum size of context window: different sizes can affect the quality and nature of the embedding
  2. weighting scheme: the model weights context words based on their distance from the target word
  3. relative position: symmetric, lef- or right side
  4. linguistic boundaries: e.g., sentence endings
27
Q

sequence labeling

A
  • assigning a categorical label to each element in a sequence of inputs (POS tags)
  • pattern recognition task
  • input: sequence of length n (x = x1…xn) (words)
  • output: sequence of length n (y = y1…yn). each yi is a label of xi. (POS tags)
28
Q

POS

A
  • tags that allow us to label sequences in ways that mimic human understanding of them
  • first step towards syntactic analysis
29
Q

POS key questions

A
  1. given a sequence of tokens, how can we predict linguistic tags for each one
  2. how can we exploit the sequential nature of this data to model hidden features and structural dependencies that will help downstream tasks
30
Q

motivation for POS tagging

A
  • humans produce and process natural language sequentially
  • but these possess a hidden structure (i.e., lexical, semantic, syntactic structure) that constrain possible sequences
  • so, merely having the right words and tags isnt sufficient for proper interpretation and meaningful communication
  • hidden structures allow humans to generalize to new infinite sequences (recursion)
31
Q

how can we replicate human linguistic comprehension in NLP

A
  1. primary goal: given a sequence of tokens, predict (hidden) linguistic tags that allow for generalization across categories
  2. secondary goal: exploit additional hidden linguistic structure to make this task more accurate
    –> i.e., the use of additional contextual and syntactic information that is not immediately apparent from the sequence of tokens alone
32
Q

POS tagging tasks

A
  1. dependency parsing
  2. semantic parsing
  3. coreference resolution
  4. information extraction
  5. question answering
33
Q

sequence labelling tasks

A
  1. POS tagging
  2. named entity recognition (NER)
  3. BIO chunking
34
Q

two approaches for mimicking human language

A
  1. machine learning
    - input
    - feature extraction
    - classification
    - output
  2. deep learning
    - input
    - FE + classification (both automatized)
    - output
35
Q

3 kinds of ML algorithms

A
  1. supervised
  2. unsupervised
  3. semi-supervised
36
Q

feature engineering (ML)

A
  • technique that leverages data to create new variables that aren’t in the training set. (both for supervised and unsupervised)
  • goal: simplifying and speeding up data transformations and enhancing model accuracy
  • use domain knowledge of the data to extract important linguistic information (features) and train a model on these
37
Q

ML in practice: 2 tasks

A
  1. describing data with features a computer can understand
    –> requires expert knowledge
  2. learning algorithm
    –> optimizing weights on features
38
Q

sequence labelling classification can be:

A
  1. independent: each member is treated independenty
  2. dependent: each member is dependent on other members for its label
39
Q

data in sequence labelling

A

open and closed classes

  • closed class
    –> english uses closed class words to express syntactic relations between words. these give clues to open class.
    –> fixed membership
  • open class
    –> membership is larger and can grow
40
Q

POS is a disambiguation task

A

though word classes do share semantic tendencies, POS is primarily defined on

  1. grammatical relationship with neighboring words
  2. morphological properties of affixes: these help distinguish between words that are morphologically related but serve different roles in sentences
41
Q

extent of POS ambiguity

A

many high frequency words have more than one POS tag

around 50% of tokens are ambiguous

42
Q

methods for POS

A
  1. lexical based: assign POS tags that occur most frequently with words in training
  2. rule-based: assign POS tags based on rules
  3. probabilistic
  4. deep learning
43
Q

hidden markov models (HMM)

A
  • based on augmenting the markov chain
  • allow us to talk about observed events (words) and hidden events (POS tags)
  • strong assumption: if we want to make a prediction about a future state, only the current state matters
    –> this simplifies computation by disregarding the entire history of previous states
  • markov assumption: P(yi=a|yi-1)
  • equivalent to bigram model
44
Q

markov chain

A

a model that tells us something about the probabilities of sequences of random variables (states) which take the value of some set

  • states are words (nodes)
  • probabilities of transitioning from one state to another are transitions (edges)
  • multiply starting probability by transition probabilities
45
Q

start distribution pi

A

vector representing the probabilities of transitioning to each possible state (i.e., sums to 1)

46
Q

HMM =

A
  • hidden MC + observed variables
  • transition matrix + emission matrix
47
Q

viterbi algorithm (HMM)

A
  • find optimal sequence of hidden states
  • most likely weather (T) sequence for the observed mood (E) sequence
  • compute probability corresponding to each possible permutation and find the one with maximum probability
    –> argmax P(x1,y1…xn,yn)
48
Q

from ML to DL

A
  1. ML
    - HMMs show us to observe a sequence (x) and predict the labels of sequence tokens (y)
    - only considers previous state, and often relies on hand-crafted features or language-specific resources for performance
    - can be combined with algorithms to consider full context (backward/forward)
  2. DL
    - can automatically learn complex features of data and model multiple dependencies between tokens
    - generalizable to out of vocabulary (OOV) words and languages.
49
Q

reasons for exploring deep learning

A
  1. large amounts of training data favor deep learning
  2. faster machines and multicore CPU/GPUs favor DL
  3. new models, algorithms, ideas
  4. better, more flexible learning of intermediate representations
  5. effective end-to-end joint system learning
  6. effective learning methods for using contexts and transferring between tasks
  7. improved performance
50
Q

perceptron

A
  • basic building block of a neural netword
  • takes one or more inputs and produces a single output
  • each input is multiplied by a weight, and the products are summed together
  • the sum is then passed through an activation function, which determines the output of the perceptron
  • output = activation_function(weighted_sum_of_inputs + bias)
  • weighted sum of inputs = dot product of input vector and weight vector
51
Q

feed-forward network

A
  • multiple layers of neurons
  • can solve linearly-separable problems
  • weights optimize the NN performance and represent unlabeled, distributed knowledge
52
Q

feed forward network applications

A
  1. text classification: sentiment analysis, language detection
  2. unsupervised learning: word2vec, dimension reduction
53
Q

to train a NN you need

A
  1. training set: ordered pairs each with input and targeted output
  2. loss function: a function to be optimized (e.g., cross entropy)
  3. optimizer: a method for adjusting weights (e.g., gradient descent)