BERT and GPT Flashcards

1
Q

Transformers

Self-Attention

A
  • Normal attention involves attention only between two different sequences
  • Each position to attends to all other positions in the sequence
  • Each word forms “query” which computes attention over each other word
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Transformers

Multi-Head Self-Attention

A
  • Used in transformer
  • Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
  • Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transformers

Transformer

A
  • Based entirely on attention, multi-head self-attention
  • Uses 1 encoder and 1 decoder
  • Can perform tasks like language translation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Transformers

What are the cons with LSTMs?

A
  • Slow at processing bc sequential
  • Not bidirectional, just perform summation either left or right
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Transformers

What are the pros with transformers?

A
  • Faster processing, not sequential, processes everything simultaneously
  • Deeply bidirectional bc of multi-head self-attention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

BERT

BERT

A
  • Bidirectional Encoder Representations from Transformers
  • Uses mutli-head self-attention
  • Only users encoder, no recurrence
  • Masked Language Modeling (MLM) and Sentence prediction
  • Pretraining and fine-tuning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

BERT

Using BERT

A

To use BERT, use model as first “layer” of final model and then train on desired task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

BERT

Masked Word Prediction

A

15% of all word tokens in each sentence selected at random. Of that 15%:
* 80%: substitute input word with MASK
* 10%: substitute input word with random word
* 10%: no change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BERT

Pretraining and Fine Tuning

A

Pretraining: having model learn language and context using MLM and sentence prediction
Fine Tuning: adjusting model parameters to fit certain tasks by running said task on the pretained embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

GPT

GPT

A
  • Generative Pretained Transformers
  • Uses unidirectional language modeling as pre-training objective
  • Only uses decoder portion of Transformer
  • Pretraining + fine-tuning
  • Uses masked self-attention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

GPT

Masked Self-Attention

A

Only attends to words that occur before the current word.
Uses this to generate new word based on previous words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

GPT

GTP-2

A
  • Scale-up of GPT1, 10x more parameters and trained on 10x more data
  • Zero-shot learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

GPT

Zero-shot

A
  • Used by GPT2
  • Model only given natural language description of task
  • No gradient updates (fine-tuning) is performed
  • i.e. only use pre-trained checkpoint
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

GPT

GPT-3

A
  • GPT-2 but even larger 1.5B -> 175B parameter models
  • Uses few-shot learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

GPT

One-shot

A
  • Model given task description and single example of the task
  • No gradient updates (fine-tuning) are performed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GPT

Few-shot

A
  • Used by GPT3
  • Model given task description and few examples of the task
  • No gradients updates (fine-tuning) are preformed
  • Only works with the very largest models
17
Q

BERT

How are output/embeddings created?

A

Pretrained token embeddings + segment embeddings + position embeddings
Segment embeddings: which sentence?
Position embeddings: what position in sentence?

18
Q

BERT

What are BERTs cons?

A

Need a lot of pretrained data

19
Q

GPT

Task Specific Pretraining and Fine-Tuning

A

Task Specific Pretraining: First randomly initialize embeddings, and train on some task that isn’t target task
Fine-Tuning: Use transfer learning and adjust parameters based on target task

20
Q

GPT

Transfer Learning

A

Transfer knowledge from one task to another

21
Q

GPT

GPT3: Issues with Fine-Tuning

A
  • Still needs a lot of data
  • Overfitting is easy bc fine tuning causes model to be very close to target
  • Not how humans learn, humans only need a few samples
  • Not fluid to understanding broad language processing,