Week 2 - Word Vectors, Language modelling Flashcards

Question 1

Q

What are embeddings

Answer

A

learned representations of the meanings of words
numerical Vector-based

Question 2

Q

What is meaning in terms of vector semantic

Answer

A

Meaning of a word is determined by how it is used in context within a language
Words with similar contexts tend to have similar meanings

Question 3

Q

What are the two types of sparse vectors

Answer

A

Tf*idf
PPMI

Question 4

Q

What is tf*idf

Answer

A

tf = - number of times a term appears in a document

calculated using log10(count(t,d) + 1)

idf = - inverse of the number of documents containing that word

calculated as log10(N / df(t))

N = total num of documents
df(t) = total number of documents containing t

Question 5

Q

What is PPMI

Answer

A

Positive Pointwise Mutual Information
Based on a term-term (word co-occurence) matrix
Compares how often a target word w and context word c co-occur
With how often they occur independently

Question 6

Q

Why +ve in PPMI

Answer

A

PMI gives a score ranging from negative inf to positive inf
PPMI replaces all negative with 0

Question 7

Q

Calculate ppmi = max(log2 pij / pipj, 0 ) from table

Answer

A

pij = co-occurrence count divided by whole table total

pi* = sum all target word counts divided by whole table total

p*j = sum all context word counts divided by whole table total

Question 8

Q

What are the 3 types of dense vectors

Answer

A

word2vec
GloVe
fastText

Question 9

Q

What is word2vec

Answer

A

2 implementations:
Skip-gram model
Predict the context words given target word

Continuous BoW
Predict the most likely current word(target), given the context

Reverse of each other

Question 10

Q

What is GloVe

Answer

A

Uses number of times (frequency) that a word appears in another word’s context (word2vec only does yes or no)
1) constructs a word co-occurrence matrix from a large corpus of text - relies on global (corpus level) statistics
2) computes probability ratios
compares p(k |ice) / p(k |steam)
high ratio score -> k is much more likely to occur with ice than steam

Question 11

Q

What does a ratio score of 1 imply (glove)

Answer

A

Likely to be non-discriminative words like “water” or “fashion” and can be cancelled out

Question 12

Q

What is asymmetric vs symmetric context windows

Answer

A

Asymmetric - only look at context before or after target word
Symmetric - look at both sides

Question 13

Q

What is fastText

Answer

A

Is able to handle subwords representation: i.e., a bag of constituent n-grams
“eating” : ea eat ati tin…

A skip gram model is learned for each n -gram
A word is represented as the sum of its n-grams

Can handle unknown words because of this

Question 14

Q

What does each dimension (element) in a sparse vector correspond to

Answer

A

Either:
- a document in a large corpus (in a term-document matrix, e.g., in TF-IDF)
- a word in the vocabulary (in a term-term matrix, e.g., in PPMI)

Question 15

Q

How do sparse and dense vectors differ in dimension

Answer

A

Sparse vectors can have a large number of dimensions (10k - large vocabularies and numbers of documents)
While dense vector are in the hundreds (100-500)

Question 16

Q

What type of vectors do training models prefer

Answer

A

Dense vectors
Learn better if there are far fewer weights

Question 17

Q

What is a static embedding

Answer

A

Each word is mapped to one fixed embedding - a single vector for each unique word

Question 18

Q

What are contextual embeddings

Answer

A

What if a word has different meanings depending on the context?
Word representations that capture the contextual meaning of words based on their surrounding context

Question 19

Q

What are Language Models

Answer

A

Statistical models or neural network architectures designed to predict the likelihood of a sequence occurring in a given context
Assigns a probability to each possible next word

Question 20

Q

What are N-gram Language models

Answer

A

Using sequence of n words/tokens
Want to predict next word w given history h
C(w,h) / C(h)

Question 21

Q

Why is it hard to compute C(h)

Answer

A

This specific word history may occur very rarely

Question 22

Q

How do we approximate C(h)

Answer

A

Computing the probability of the preceding word(s)
Eg unigram p(wi)
bigram p(wi | wi-1)
trigram p(wi |wi-1, wi-2)

Question 23

Q

How do we calculate MLE for n-gram probabilities

Answer

A

For bigram:

C(wn-1nw) = number of times bigram appears
C(wn-1) = number of time context word appears

Question 24

Q

MLE sentence example using bigrams

Answer

A

P(<s> I want chinese food </s>) = P( I|<s>) x P(want |I) x ... x P(</s> |food)

Question 25

Q

How can we use n-gram models for generation

Answer

A

First we randomly sample the starting word
Continue to sample random words conditioned on preceding choice (word most likely to succeed)
Stop when end symbol generated / pre determined length reached

Question 26

Q

How do we carry out random sampling for word generation

Answer

A

We order all potential succeeding words
eg increase… [the/of/a/to/in…]
Probability of all these words sums to 1
Pick a random number between 0 and 1 to randomly create word
(succeeding words will have different likelihoods of being chosen)

Question 27

Q

What is a recurrent neural network

Answer

A

A network that contains a cycle
i.e. uses its own earlier output as input

the hidden layer has a recurrent
connection: the activation value of
the hidden layer depends on the
current input and the activation
value of the hidden layer from the
previous time step

ht state drawn with a loop back to itself

Question 28

Q

What is the hidden layer in an RNN

Answer

A

Where the cycle exists
The input (activation value) to the hidden layer is a combination of the current input and the activation values of the same neurons from the previous time step

Question 29

Q

What is a simple RNN

Answer

A

At each timestep, a sequence is presented to the RNN
input vector = xt at time t
xt is multiplied by weight matrix W
W.xt is then passed to a non-linear activation function to obtain values for hidden layer ht
passed to output layer: weight matrix V to get vocabulary scores
converted to a probability distribution yt

Question 30

Q

What is a unit in an RNN

Answer

A

a timestep

Question 31

Q

What does the input ht-1 in the hidden layer provide

Answer

A

A form of memory
allows the network to make use of past context

Question 32

Q

How can RNNs be used as LMs

Answer

A

They
1) process an input sequence one word at a time
2) predict the next word from the current word and the previous hidden state

Question 33

Q

What happens in the Vh layer

Answer

A

The dot product is computed between the hidden state ht and a weight matrix V
Each neuron represents the score or “likelihood” of the corresponding word being the next word in the sequence

Question 34

Q

What is the form of yt

Answer

A

Holds a probability distribution
a vector that has the same dimensionality as our word vectors
yt[i] = probability that a particular word i is the next word

Question 35

Q

How do you train RNN as a language model

Answer

A

1) Trained on corpus as training data
2) Computes the error
3) aims to minimise the error

Question 36

Q

For cross entropy, what is the correct distribution (ground truth) of yt

Answer

A

A one hot vector

Question 37

Q

What is teacher forcing in LMs

Answer

A

The model is given the correct history sequence (instead of using the RNN’s own predictions from t-1, RNN receives the actual ground truth output yt−1)

Question 38

Q

What is gradient descent

Answer

A

optimization algorithm used to adjust the weights in the RNN, in order to minimise the CE loss (LCE) averaged over the sequence

Question 39

Q

What is autoregressive generation

Answer

A

Each word generated at t is conditioned on the word selected by the RNN from the previous time step t-1

Question 40

Q

What is often used instead of <s> (first input in RNN)</s>

Answer

A

In other applications like machine translation, a richer context is input instead

Question 41

Q

What are the applications of autoregressive generation

Answer

A

Text generation, machine translation, summarisation

Question 42

Q

What is the key differences between n-grams and RNNs as LMs

Answer

A

n-gram language models incorporate information from limited context only (preceding n-1 tokens)
RNN language models: the hidden state can incorporate information from all the
preceding words

Brainscape's Knowledge GenomeTM

Week 2 - Word Vectors, Language modelling Flashcards

Brainscape's Knowledge Genome^TM