NLP Flashcards
Word2Vec vs Glove
Word2Vec:
Model updates using skip gram with negative sampling. Negative sampling for non appearing words (only updates ~20 words not appearing in window)
Glove:
Count-based model (No model)
Uses co-occurrence matrix of words to create vectors
Difference between LSTM and GRU.
LSTMs:
Using gating mechanism to combat vanishing gradients (improvenmnet on RNN)
3 gates (input, forget, output)
2 states (cell, hidden)
More parameters, slower to train, need more data
Better at capturing long range dependancies
GRUs:
2 gates (reset, update(combines forget/input into 1))
1 state (hidden)
Less parameters, faster to train, need less data
Not as good as capturing long range dependancies
What is a CBOW and Skip Gram model?
CBOW:
When you use context words before and after a center word.
Skip-Gram:
All context words become training samples w/ respect to the centre word. This is how W2V trains?
How does word2vec train?
- Define vocab size
- Initialize embedding and context matrices of size (embedding size x vocab size)
- Use skipgram with negative sampling
3a. Take dot product of center word with context words
3b. Scale output values with sigmoid
3c. Backpropagate to update center word in embedding matrix and context words in context matrix
3d. Update random 20 non context words (neg sampling)
Viterbi vs Beam Search vs Greedy
Viterbi:
Searches all possible candidates at each step
Beam Search:
Searches the K best possible candidates
Greedy:
Take the best possible candidate at each time step
What are some NLP Libraries in Python?
Gensim, NLTK, SpaCy, JohnSnowLabs, AllenNLP, HuggingFace, TFHub
Pytorch, Tensorflow, Jax..
What is BERT?
BERT:
- Transformer encoder trained to predict masked words
- Masks 15% of tokens in sentence
- Sometimes randomly replaces words and tries to predict the correct word (adds noise to embedding)
- Sometimes predicts order if sentence B follows sentence A (50% true, 50% random sentence)
- Embeddings: Positional, Segment, and Token Embedding
What is ELMO?
ELMO:
- Two stacked bidirectional LSTM language model trained to predict the next word (language modeling)
- At inference, takes the weighted sum of the hidden states from each layer of the bi-LSTM and the raw word vector
What are 3 subword embedding strategies?
BPE:
1. Start with vocab of characters
2. Add most frequent n-gram pair to vocab
3. Continue until target vocab size reached
WordPiece, SentencePiece
- Same as BPE except n-gram pairs are chosen based on the highest likelihood
- WP treats words seperately
- SP treats entire sentence as 1 string with _ replacing spaces
How does RoBERTa improve on BERT?
- Dynamic Masking: Mask tokens when fed into model (results in different mask on each epoch)
- Larger Mini-Batch Size
- Increased BPE vocab size