NLP Flashcards

1
Q

Word2Vec vs Glove

A

Word2Vec:
Model updates using skip gram with negative sampling. Negative sampling for non appearing words (only updates ~20 words not appearing in window)

Glove:
Count-based model (No model)
Uses co-occurrence matrix of words to create vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Difference between LSTM and GRU.

A

LSTMs:
Using gating mechanism to combat vanishing gradients (improvenmnet on RNN)
3 gates (input, forget, output)
2 states (cell, hidden)
More parameters, slower to train, need more data
Better at capturing long range dependancies

GRUs:
2 gates (reset, update(combines forget/input into 1))
1 state (hidden)
Less parameters, faster to train, need less data
Not as good as capturing long range dependancies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a CBOW and Skip Gram model?

A

CBOW:
When you use context words before and after a center word.

Skip-Gram:
All context words become training samples w/ respect to the centre word. This is how W2V trains?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does word2vec train?

A
  1. Define vocab size
  2. Initialize embedding and context matrices of size (embedding size x vocab size)
  3. Use skipgram with negative sampling
    3a. Take dot product of center word with context words
    3b. Scale output values with sigmoid
    3c. Backpropagate to update center word in embedding matrix and context words in context matrix
    3d. Update random 20 non context words (neg sampling)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Viterbi vs Beam Search vs Greedy

A

Viterbi:
Searches all possible candidates at each step

Beam Search:
Searches the K best possible candidates

Greedy:
Take the best possible candidate at each time step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some NLP Libraries in Python?

A

Gensim, NLTK, SpaCy, JohnSnowLabs, AllenNLP, HuggingFace, TFHub

Pytorch, Tensorflow, Jax..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is BERT?

A

BERT:
- Transformer encoder trained to predict masked words
- Masks 15% of tokens in sentence
- Sometimes randomly replaces words and tries to predict the correct word (adds noise to embedding)
- Sometimes predicts order if sentence B follows sentence A (50% true, 50% random sentence)
- Embeddings: Positional, Segment, and Token Embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is ELMO?

A

ELMO:
- Two stacked bidirectional LSTM language model trained to predict the next word (language modeling)
- At inference, takes the weighted sum of the hidden states from each layer of the bi-LSTM and the raw word vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are 3 subword embedding strategies?

A

BPE:
1. Start with vocab of characters
2. Add most frequent n-gram pair to vocab
3. Continue until target vocab size reached

WordPiece, SentencePiece
- Same as BPE except n-gram pairs are chosen based on the highest likelihood
- WP treats words seperately
- SP treats entire sentence as 1 string with _ replacing spaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does RoBERTa improve on BERT?

A
  • Dynamic Masking: Mask tokens when fed into model (results in different mask on each epoch)
  • Larger Mini-Batch Size
  • Increased BPE vocab size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly