Week 2 - Word Vectors, Language modelling Flashcards

1
Q

What are embeddings

A

learned representations of the meanings of words
numerical Vector-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is meaning in terms of vector semantic

A

Meaning of a word is determined by how it is used in context within a language
Words with similar contexts tend to have similar meanings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two types of sparse vectors

A

Tf*idf
PPMI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is tf*idf

A

tf = - number of times a term appears in a document

calculated using log10(count(t,d) + 1)

idf = - inverse of the number of documents containing that word

calculated as log10(N / df(t))

N = total num of documents
df(t) = total number of documents containing t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is PPMI

A

Positive Pointwise Mutual Information
Based on a term-term (word co-occurence) matrix
Compares how often a target word w and context word c co-occur
With how often they occur independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why +ve in PPMI

A

PMI gives a score ranging from negative inf to positive inf
PPMI replaces all negative with 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Calculate ppmi = max(log2 pij / pipj, 0 ) from table

A

pij = co-occurrence count divided by whole table total

pi* = sum all target word counts divided by whole table total

p*j = sum all context word counts divided by whole table total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 3 types of dense vectors

A

word2vec
GloVe
fastText

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is word2vec

A

2 implementations:
Skip-gram model
Predict the context words given target word

Continuous BoW
Predict the most likely current word(target), given the context

Reverse of each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is GloVe

A

Uses number of times (frequency) that a word appears in another word’s context (word2vec only does yes or no)
1) constructs a word co-occurrence matrix from a large corpus of text - relies on global (corpus level) statistics
2) computes probability ratios
compares p(k |ice) / p(k |steam)
high ratio score -> k is much more likely to occur with ice than steam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does a ratio score of 1 imply (glove)

A

Likely to be non-discriminative words like “water” or “fashion” and can be cancelled out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is asymmetric vs symmetric context windows

A

Asymmetric - only look at context before or after target word
Symmetric - look at both sides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is fastText

A

Is able to handle subwords representation: i.e., a bag of constituent n-grams
“eating” : ea eat ati tin…

A skip gram model is learned for each n -gram
A word is represented as the sum of its n-grams

Can handle unknown words because of this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does each dimension (element) in a sparse vector correspond to

A

Either:
- a document in a large corpus (in a term-document matrix, e.g., in TF-IDF)
- a word in the vocabulary (in a term-term matrix, e.g., in PPMI)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do sparse and dense vectors differ in dimension

A

Sparse vectors can have a large number of dimensions (10k - large vocabularies and numbers of documents)
While dense vector are in the hundreds (100-500)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What type of vectors do training models prefer

A

Dense vectors
Learn better if there are far fewer weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a static embedding

A

Each word is mapped to one fixed embedding - a single vector for each unique word

18
Q

What are contextual embeddings

A

What if a word has different meanings depending on the context?
Word representations that capture the contextual meaning of words based on their surrounding context

19
Q

What are Language Models

A

Statistical models or neural network architectures designed to predict the likelihood of a sequence occurring in a given context
Assigns a probability to each possible next word

20
Q

What are N-gram Language models

A

Using sequence of n words/tokens
Want to predict next word w given history h
C(w,h) / C(h)

21
Q

Why is it hard to compute C(h)

A

This specific word history may occur very rarely

22
Q

How do we approximate C(h)

A

Computing the probability of the preceding word(s)
Eg unigram p(wi)
bigram p(wi | wi-1)
trigram p(wi |wi-1, wi-2)

23
Q

How do we calculate MLE for n-gram probabilities

A

For bigram:

C(wn-1nw) = number of times bigram appears
C(wn-1) = number of time context word appears

24
Q

MLE sentence example using bigrams

A

P(<s> I want chinese food </s>) = P( I|<s>) x P(want |I) x ... x P(</s> |food)

25
Q

How can we use n-gram models for generation

A

First we randomly sample the starting word
Continue to sample random words conditioned on preceding choice (word most likely to succeed)
Stop when end symbol generated / pre determined length reached

26
Q

How do we carry out random sampling for word generation

A

We order all potential succeeding words
eg increase… [the/of/a/to/in…]
Probability of all these words sums to 1
Pick a random number between 0 and 1 to randomly create word
(succeeding words will have different likelihoods of being chosen)

27
Q

What is a recurrent neural network

A

A network that contains a cycle
i.e. uses its own earlier output as input

the hidden layer has a recurrent
connection: the activation value of
the hidden layer depends on the
current input and the activation
value of the hidden layer from the
previous time step

ht state drawn with a loop back to itself

28
Q

What is the hidden layer in an RNN

A

Where the cycle exists
The input (activation value) to the hidden layer is a combination of the current input and the activation values of the same neurons from the previous time step

29
Q

What is a simple RNN

A

At each timestep, a sequence is presented to the RNN
input vector = xt at time t
xt is multiplied by weight matrix W
W.xt is then passed to a non-linear activation function to obtain values for hidden layer ht
passed to output layer: weight matrix V to get vocabulary scores
converted to a probability distribution yt

30
Q

What is a unit in an RNN

A

a timestep

31
Q

What does the input ht-1 in the hidden layer provide

A

A form of memory
allows the network to make use of past context

32
Q

How can RNNs be used as LMs

A

They
1) process an input sequence one word at a time
2) predict the next word from the current word and the previous hidden state

33
Q

What happens in the Vh layer

A

The dot product is computed between the hidden state ht and a weight matrix V
Each neuron represents the score or “likelihood” of the corresponding word being the next word in the sequence

34
Q

What is the form of yt

A

Holds a probability distribution
a vector that has the same dimensionality as our word vectors
yt[i] = probability that a particular word i is the next word

35
Q

How do you train RNN as a language model

A

1) Trained on corpus as training data
2) Computes the error
3) aims to minimise the error

36
Q

For cross entropy, what is the correct distribution (ground truth) of yt

A

A one hot vector

37
Q

What is teacher forcing in LMs

A

The model is given the correct history sequence (instead of using the RNN’s own predictions from t-1, RNN receives the actual ground truth output yt−1)

38
Q

What is gradient descent

A

optimization algorithm used to adjust the weights in the RNN, in order to minimise the CE loss (LCE) averaged over the sequence

39
Q

What is autoregressive generation

A

Each word generated at t is conditioned on the word selected by the RNN from the previous time step t-1

40
Q

What is often used instead of <s> (first input in RNN)</s>

A

In other applications like machine translation, a richer context is input instead

41
Q

What are the applications of autoregressive generation

A

Text generation, machine translation, summarisation

42
Q

What is the key differences between n-grams and RNNs as LMs

A
  • n-gram language models incorporate information from limited context only (preceding n-1 tokens)
  • RNN language models: the hidden state can incorporate information from all the
    preceding words