NLP Flashcards

Question 1

Q

Word2Vec vs Glove

Answer

A

Word2Vec:
Model updates using skip gram with negative sampling. Negative sampling for non appearing words (only updates ~20 words not appearing in window)

Glove:
Count-based model (No model)
Uses co-occurrence matrix of words to create vectors

Question 2

Q

Difference between LSTM and GRU.

Answer

A

LSTMs:
Using gating mechanism to combat vanishing gradients (improvenmnet on RNN)
3 gates (input, forget, output)
2 states (cell, hidden)
More parameters, slower to train, need more data
Better at capturing long range dependancies

GRUs:
2 gates (reset, update(combines forget/input into 1))
1 state (hidden)
Less parameters, faster to train, need less data
Not as good as capturing long range dependancies

Question 3

Q

What is a CBOW and Skip Gram model?

Answer

A

CBOW:
When you use context words before and after a center word.

Skip-Gram:
All context words become training samples w/ respect to the centre word. This is how W2V trains?

Question 4

Q

How does word2vec train?

Answer

A

Define vocab size
Initialize embedding and context matrices of size (embedding size x vocab size)
Use skipgram with negative sampling
3a. Take dot product of center word with context words
3b. Scale output values with sigmoid
3c. Backpropagate to update center word in embedding matrix and context words in context matrix
3d. Update random 20 non context words (neg sampling)

Question 5

Q

Viterbi vs Beam Search vs Greedy

Answer

A

Viterbi:
Searches all possible candidates at each step

Beam Search:
Searches the K best possible candidates

Greedy:
Take the best possible candidate at each time step

Question 6

Q

What are some NLP Libraries in Python?

Answer

A

Gensim, NLTK, SpaCy, JohnSnowLabs, AllenNLP, HuggingFace, TFHub

Pytorch, Tensorflow, Jax..

Question 7

Q

What is BERT?

Answer

A

BERT:
- Transformer encoder trained to predict masked words
- Masks 15% of tokens in sentence
- Sometimes randomly replaces words and tries to predict the correct word (adds noise to embedding)
- Sometimes predicts order if sentence B follows sentence A (50% true, 50% random sentence)
- Embeddings: Positional, Segment, and Token Embedding

Question 8

Q

What is ELMO?

Answer

A

ELMO:
- Two stacked bidirectional LSTM language model trained to predict the next word (language modeling)
- At inference, takes the weighted sum of the hidden states from each layer of the bi-LSTM and the raw word vector

Question 9

Q

What are 3 subword embedding strategies?

Answer

A

BPE:
1. Start with vocab of characters
2. Add most frequent n-gram pair to vocab
3. Continue until target vocab size reached

WordPiece, SentencePiece
- Same as BPE except n-gram pairs are chosen based on the highest likelihood
- WP treats words seperately
- SP treats entire sentence as 1 string with _ replacing spaces

Question 10

Q

How does RoBERTa improve on BERT?

Answer

A

Dynamic Masking: Mask tokens when fed into model (results in different mask on each epoch)
Larger Mini-Batch Size
Increased BPE vocab size

NLP Flashcards

(10 cards)