BERT and GPT Flashcards
Transformers
Self-Attention
- Normal attention involves attention only between two different sequences
- Each position to attends to all other positions in the sequence
- Each word forms “query” which computes attention over each other word
Transformers
Multi-Head Self-Attention
- Used in transformer
- Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
- Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input
Transformers
Transformer
- Based entirely on attention, multi-head self-attention
- Uses 1 encoder and 1 decoder
- Can perform tasks like language translation
Transformers
What are the cons with LSTMs?
- Slow at processing bc sequential
- Not bidirectional, just perform summation either left or right
Transformers
What are the pros with transformers?
- Faster processing, not sequential, processes everything simultaneously
- Deeply bidirectional bc of multi-head self-attention
BERT
BERT
- Bidirectional Encoder Representations from Transformers
- Uses mutli-head self-attention
- Only users encoder, no recurrence
- Masked Language Modeling (MLM) and Sentence prediction
- Pretraining and fine-tuning
BERT
Using BERT
To use BERT, use model as first “layer” of final model and then train on desired task
BERT
Masked Word Prediction
15% of all word tokens in each sentence selected at random. Of that 15%:
* 80%: substitute input word with MASK
* 10%: substitute input word with random word
* 10%: no change
BERT
Pretraining and Fine Tuning
Pretraining: having model learn language and context using MLM and sentence prediction
Fine Tuning: adjusting model parameters to fit certain tasks by running said task on the pretained embeddings
GPT
GPT
- Generative Pretained Transformers
- Uses unidirectional language modeling as pre-training objective
- Only uses decoder portion of Transformer
- Pretraining + fine-tuning
- Uses masked self-attention
GPT
Masked Self-Attention
Only attends to words that occur before the current word.
Uses this to generate new word based on previous words.
GPT
GTP-2
- Scale-up of GPT1, 10x more parameters and trained on 10x more data
- Zero-shot learning
GPT
Zero-shot
- Used by GPT2
- Model only given natural language description of task
- No gradient updates (fine-tuning) is performed
- i.e. only use pre-trained checkpoint
GPT
GPT-3
- GPT-2 but even larger 1.5B -> 175B parameter models
- Uses few-shot learning
GPT
One-shot
- Model given task description and single example of the task
- No gradient updates (fine-tuning) are performed
GPT
Few-shot
- Used by GPT3
- Model given task description and few examples of the task
- No gradients updates (fine-tuning) are preformed
- Only works with the very largest models
BERT
How are output/embeddings created?
Pretrained token embeddings + segment embeddings + position embeddings
Segment embeddings: which sentence?
Position embeddings: what position in sentence?
BERT
What are BERTs cons?
Need a lot of pretrained data
GPT
Task Specific Pretraining and Fine-Tuning
Task Specific Pretraining: First randomly initialize embeddings, and train on some task that isn’t target task
Fine-Tuning: Use transfer learning and adjust parameters based on target task
GPT
Transfer Learning
Transfer knowledge from one task to another
GPT
GPT3: Issues with Fine-Tuning
- Still needs a lot of data
- Overfitting is easy bc fine tuning causes model to be very close to target
- Not how humans learn, humans only need a few samples
- Not fluid to understanding broad language processing,