NLP_DL (Oxford) Flashcards
Distributional hypothesis
words that occur in the same contexts tend to have the same meanings
Negative sampling
- related to word2vec paper:
in the formula / sum () the denominator is computed using only a random sample of ‘negative’ contexts (different from the numerator one)
Being ‘grounded in the task’
= ‘meaning’ (as word - in the context of course’s taks-specific features)
CBoW steps (Mikolov, 2013)
- embed words to vectors
- sum vectors
- project the result back to vocabulary size
- apply softmax
n-gram model ~ k-th order Markov model
- only immediate history counts
- with limited history for previous k-1 items
linear interpolation of probability of a 3-gram
linear combination of trigram, bigram, and unigram with coefficients sum equal to 1
training objective for feedforward NN
- cross-entropy of the data given the model:
-(1/N) sum(cost(w_n, p_n))
with cost being the log-probability:
cost(a, b) = a^T log b
NNLM comparison with N-Gram LM - good
- good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
- smaller memory footprint
NNLM comparison with N-Gram LM - bad
- number of parameters scales with n-gram size
- doesn’t take into account the frequencies of words
- limited the length of the dependencies captured (n-gram size)
RNNLM comparison with N-Gram LM - good
- can represent unbounded dependencies
- compress history into fixed size vector
- the number of params grow only with the information stored in the hidden layer
RNNLM comparison with N-Gram LM - bad
- difficult to learn
- memory and computation complexity increase quadratically with the size of the hidden layer
- doesn’t take into account the frequencies of words
Statistical text classification (STC) - key questions
2 steps process in order to calculate P(c|d):
- process text for representation: how to represent d
- classify the document using the text representation: how to calculate P(c|d)
STC - generative models
- joint model P(c, d)
- model the distribution of individual classes
- place probabilities over both hidden vars (classes) and observed data
STC - discriminative models
- conditional model P(c|d)
- learn boundaries between classes
- with data as given place probabilities over hidden vars
STC - Naive Bayes: pros and cons
Pros: - simple - fast - uses BOW representation - interpretable Cons: - feature independence condition too strong - doc structure/semantic ignored - require smoothing to deal with 0 probabilities
STC - logistic regression: pros and cons
Pros: - interpretable - relativly simple - no assumptions of independence between fewatures Cons: - harder to learn - manually designed features - more difficult to generalize well bc of the hand crafted features
RNNLM text representation
- RNNLM is agnostic to the recurrent function
- it reads input x_i to accumulate state h_i, and predict output y_i
- for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i
- thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier
RNNLM + Logistic regression steps
No RNN output layer y is needed!
- take RNN state as input: X = h_n
- compute class c weights: f_c = sum(beta_ci * X_i)
- apply nonlinearity: m_c = sigma(f_c)
- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})
- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution
L_i = - sum_c(y_c * log P(c|d_i)
RNN mechanics
It compresses the entire history into a fixed length vector to capture ‘long range’ correlations
LSTM gates
- provide the way to optionally let information through
- are composed out of a sigmoid neural net layer and a pointwise multiplication operation
- LSTM cell uses three such gates:
- ‘forget gate layer’ (on hidden state)
- ‘input gate layer’ (on input data)
- ‘output gate layer’ (on output data)
GRU gates vs LSTM
GRU changes:
- merges ‘input’ and ‘forget’ gates into one ‘update gate layer’
- merges the hidden state with the output state