NLP_DL (Oxford) Flashcards
Distributional hypothesis
words that occur in the same contexts tend to have the same meanings
Negative sampling
- related to word2vec paper:
in the formula / sum () the denominator is computed using only a random sample of ‘negative’ contexts (different from the numerator one)
Being ‘grounded in the task’
= ‘meaning’ (as word - in the context of course’s taks-specific features)
CBoW steps (Mikolov, 2013)
- embed words to vectors
- sum vectors
- project the result back to vocabulary size
- apply softmax
n-gram model ~ k-th order Markov model
- only immediate history counts
- with limited history for previous k-1 items
linear interpolation of probability of a 3-gram
linear combination of trigram, bigram, and unigram with coefficients sum equal to 1
training objective for feedforward NN
- cross-entropy of the data given the model:
-(1/N) sum(cost(w_n, p_n))
with cost being the log-probability:
cost(a, b) = a^T log b
NNLM comparison with N-Gram LM - good
- good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
- smaller memory footprint
NNLM comparison with N-Gram LM - bad
- number of parameters scales with n-gram size
- doesn’t take into account the frequencies of words
- limited the length of the dependencies captured (n-gram size)
RNNLM comparison with N-Gram LM - good
- can represent unbounded dependencies
- compress history into fixed size vector
- the number of params grow only with the information stored in the hidden layer
RNNLM comparison with N-Gram LM - bad
- difficult to learn
- memory and computation complexity increase quadratically with the size of the hidden layer
- doesn’t take into account the frequencies of words
Statistical text classification (STC) - key questions
2 steps process in order to calculate P(c|d):
- process text for representation: how to represent d
- classify the document using the text representation: how to calculate P(c|d)
STC - generative models
- joint model P(c, d)
- model the distribution of individual classes
- place probabilities over both hidden vars (classes) and observed data
STC - discriminative models
- conditional model P(c|d)
- learn boundaries between classes
- with data as given place probabilities over hidden vars
STC - Naive Bayes: pros and cons
Pros: - simple - fast - uses BOW representation - interpretable Cons: - feature independence condition too strong - doc structure/semantic ignored - require smoothing to deal with 0 probabilities