NLP_DL (Oxford) Flashcards

1
Q

Distributional hypothesis

A

words that occur in the same contexts tend to have the same meanings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Negative sampling

A
  • related to word2vec paper:
    in the formula / sum () the denominator is computed using only a random sample of ‘negative’ contexts (different from the numerator one)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Being ‘grounded in the task’

A

= ‘meaning’ (as word - in the context of course’s taks-specific features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

CBoW steps (Mikolov, 2013)

A
  • embed words to vectors
  • sum vectors
  • project the result back to vocabulary size
  • apply softmax
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

n-gram model ~ k-th order Markov model

A
  • only immediate history counts

- with limited history for previous k-1 items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

linear interpolation of probability of a 3-gram

A

linear combination of trigram, bigram, and unigram with coefficients sum equal to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

training objective for feedforward NN

A
  • cross-entropy of the data given the model:
    -(1/N) sum(cost(w_n, p_n))
    with cost being the log-probability:
    cost(a, b) = a^T log b
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

NNLM comparison with N-Gram LM - good

A
  • good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
  • smaller memory footprint
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NNLM comparison with N-Gram LM - bad

A
  • number of parameters scales with n-gram size
  • doesn’t take into account the frequencies of words
  • limited the length of the dependencies captured (n-gram size)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

RNNLM comparison with N-Gram LM - good

A
  • can represent unbounded dependencies
  • compress history into fixed size vector
  • the number of params grow only with the information stored in the hidden layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

RNNLM comparison with N-Gram LM - bad

A
  • difficult to learn
  • memory and computation complexity increase quadratically with the size of the hidden layer
  • doesn’t take into account the frequencies of words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Statistical text classification (STC) - key questions

A

2 steps process in order to calculate P(c|d):

  • process text for representation: how to represent d
  • classify the document using the text representation: how to calculate P(c|d)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

STC - generative models

A
  • joint model P(c, d)
  • model the distribution of individual classes
  • place probabilities over both hidden vars (classes) and observed data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

STC - discriminative models

A
  • conditional model P(c|d)
  • learn boundaries between classes
  • with data as given place probabilities over hidden vars
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

STC - Naive Bayes: pros and cons

A
Pros:
- simple
- fast
- uses BOW representation
- interpretable
Cons:
- feature independence condition too strong
- doc structure/semantic ignored
- require smoothing to deal with 0 probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

STC - logistic regression: pros and cons

A
Pros:
- interpretable
- relativly simple
- no assumptions of independence between fewatures
Cons:
- harder to learn
- manually designed features
- more difficult to generalize well bc of the hand crafted  features
17
Q

RNNLM text representation

A
  • RNNLM is agnostic to the recurrent function
  • it reads input x_i to accumulate state h_i, and predict output y_i
  • for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i
  • thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier
18
Q

RNNLM + Logistic regression steps

A

No RNN output layer y is needed!
- take RNN state as input: X = h_n
- compute class c weights: f_c = sum(beta_ci * X_i)
- apply nonlinearity: m_c = sigma(f_c)
- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})
- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution
L_i = - sum_c(y_c * log P(c|d_i)

19
Q

RNN mechanics

A

It compresses the entire history into a fixed length vector to capture ‘long range’ correlations

20
Q

LSTM gates

A
  • provide the way to optionally let information through
  • are composed out of a sigmoid neural net layer and a pointwise multiplication operation
  • LSTM cell uses three such gates:
  • ‘forget gate layer’ (on hidden state)
  • ‘input gate layer’ (on input data)
  • ‘output gate layer’ (on output data)
21
Q

GRU gates vs LSTM

A

GRU changes:

  • merges ‘input’ and ‘forget’ gates into one ‘update gate layer’
  • merges the hidden state with the output state