NLP_DL (Oxford) Flashcards

Question 1

Q

Distributional hypothesis

Answer

A

words that occur in the same contexts tend to have the same meanings

Question 2

Q

Negative sampling

Answer

A

related to word2vec paper:
in the formula / sum () the denominator is computed using only a random sample of ‘negative’ contexts (different from the numerator one)

Question 3

Q

Being ‘grounded in the task’

Answer

A

= ‘meaning’ (as word - in the context of course’s taks-specific features)

Question 4

Q

CBoW steps (Mikolov, 2013)

Answer

A

embed words to vectors
sum vectors
project the result back to vocabulary size
apply softmax

Question 5

Q

n-gram model ~ k-th order Markov model

Answer

A

only immediate history counts

- with limited history for previous k-1 items

Question 6

Q

linear interpolation of probability of a 3-gram

Answer

A

linear combination of trigram, bigram, and unigram with coefficients sum equal to 1

Question 7

Q

training objective for feedforward NN

Answer

A

cross-entropy of the data given the model:
-(1/N) sum(cost(w_n, p_n))
with cost being the log-probability:
cost(a, b) = a^T log b

Question 8

Q

NNLM comparison with N-Gram LM - good

Answer

A

good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
smaller memory footprint

Question 9

Q

NNLM comparison with N-Gram LM - bad

Answer

A

number of parameters scales with n-gram size
doesn’t take into account the frequencies of words
limited the length of the dependencies captured (n-gram size)

Question 10

Q

RNNLM comparison with N-Gram LM - good

Answer

A

can represent unbounded dependencies
compress history into fixed size vector
the number of params grow only with the information stored in the hidden layer

Question 11

Q

RNNLM comparison with N-Gram LM - bad

Answer

A

difficult to learn
memory and computation complexity increase quadratically with the size of the hidden layer
doesn’t take into account the frequencies of words

Question 12

Q

Statistical text classification (STC) - key questions

Answer

A

2 steps process in order to calculate P(c|d):

process text for representation: how to represent d
classify the document using the text representation: how to calculate P(c|d)

Question 13

Q

STC - generative models

Answer

A

joint model P(c, d)
model the distribution of individual classes
place probabilities over both hidden vars (classes) and observed data

Question 14

Q

STC - discriminative models

Answer

A

conditional model P(c|d)
learn boundaries between classes
with data as given place probabilities over hidden vars

Question 15

Q

STC - Naive Bayes: pros and cons

Answer

A

Pros:
- simple
- fast
- uses BOW representation
- interpretable
Cons:
- feature independence condition too strong
- doc structure/semantic ignored
- require smoothing to deal with 0 probabilities

Question 16

Q

STC - logistic regression: pros and cons

Answer

A

Pros:
- interpretable
- relativly simple
- no assumptions of independence between fewatures
Cons:
- harder to learn
- manually designed features
- more difficult to generalize well bc of the hand crafted  features

Question 17

Q

RNNLM text representation

Answer

A

RNNLM is agnostic to the recurrent function
it reads input x_i to accumulate state h_i, and predict output y_i
for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i
thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier

Question 18

Q

RNNLM + Logistic regression steps

Answer

A

No RNN output layer y is needed!
- take RNN state as input: X = h_n
- compute class c weights: f_c = sum(beta_ci * X_i)
- apply nonlinearity: m_c = sigma(f_c)
- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})
- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution
L_i = - sum_c(y_c * log P(c|d_i)

Question 19

Q

RNN mechanics

Answer

A

It compresses the entire history into a fixed length vector to capture ‘long range’ correlations

Question 20

Q

LSTM gates

Answer

A

provide the way to optionally let information through
are composed out of a sigmoid neural net layer and a pointwise multiplication operation
LSTM cell uses three such gates:
‘forget gate layer’ (on hidden state)
‘input gate layer’ (on input data)
‘output gate layer’ (on output data)

Question 21

Q

GRU gates vs LSTM

Answer

A

GRU changes:

merges ‘input’ and ‘forget’ gates into one ‘update gate layer’
merges the hidden state with the output state