BERT and GPT Flashcards
Transformers
Self-Attention
- Normal attention involves attention only between two different sequences
- Each position to attends to all other positions in the sequence
- Each word forms “query” which computes attention over each other word
Transformers
Multi-Head Self-Attention
- Used in transformer
- Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
- Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input
Transformers
Transformer
- Based entirely on attention, multi-head self-attention
- Uses 1 encoder and 1 decoder
- Can perform tasks like language translation
Transformers
What are the cons with LSTMs?
- Slow at processing bc sequential
- Not bidirectional, just perform summation either left or right
Transformers
What are the pros with transformers?
- Faster processing, not sequential, processes everything simultaneously
- Deeply bidirectional bc of multi-head self-attention
BERT
BERT
- Bidirectional Encoder Representations from Transformers
- Uses mutli-head self-attention
- Only users encoder, no recurrence
- Masked Language Modeling (MLM) and Sentence prediction
- Pretraining and fine-tuning
BERT
Using BERT
To use BERT, use model as first “layer” of final model and then train on desired task
BERT
Masked Word Prediction
15% of all word tokens in each sentence selected at random. Of that 15%:
* 80%: substitute input word with MASK
* 10%: substitute input word with random word
* 10%: no change
BERT
Pretraining and Fine Tuning
Pretraining: having model learn language and context using MLM and sentence prediction
Fine Tuning: adjusting model parameters to fit certain tasks by running said task on the pretained embeddings
GPT
GPT
- Generative Pretained Transformers
- Uses unidirectional language modeling as pre-training objective
- Only uses decoder portion of Transformer
- Pretraining + fine-tuning
- Uses masked self-attention
GPT
Masked Self-Attention
Only attends to words that occur before the current word.
Uses this to generate new word based on previous words.
GPT
GTP-2
- Scale-up of GPT1, 10x more parameters and trained on 10x more data
- Zero-shot learning
GPT
Zero-shot
- Used by GPT2
- Model only given natural language description of task
- No gradient updates (fine-tuning) is performed
- i.e. only use pre-trained checkpoint
GPT
GPT-3
- GPT-2 but even larger 1.5B -> 175B parameter models
- Uses few-shot learning
GPT
One-shot
- Model given task description and single example of the task
- No gradient updates (fine-tuning) are performed