BERT and GPT Flashcards

Question 1

Q

Transformers

Self-Attention

Answer

A

Normal attention involves attention only between two different sequences
Each position to attends to all other positions in the sequence
Each word forms “query” which computes attention over each other word

Question 2

Q

Transformers

Multi-Head Self-Attention

Answer

A

Used in transformer
Captures different relationships between tokens by performing multiple attention operations (called “heads”) in parallel
Each “head” focuses on different parts of the input/different dependencies, giving model richer understanding of input

Question 3

Q

Transformers

Transformer

Answer

A

Based entirely on attention, multi-head self-attention
Uses 1 encoder and 1 decoder
Can perform tasks like language translation

Question 4

Q

Transformers

What are the cons with LSTMs?

Answer

A

Slow at processing bc sequential
Not bidirectional, just perform summation either left or right

Question 5

Q

Transformers

What are the pros with transformers?

Answer

A

Faster processing, not sequential, processes everything simultaneously
Deeply bidirectional bc of multi-head self-attention

Question 6

Q

BERT

Answer

A

Bidirectional Encoder Representations from Transformers
Uses mutli-head self-attention
Only users encoder, no recurrence
Masked Language Modeling (MLM) and Sentence prediction
Pretraining and fine-tuning

Question 7

Q

BERT

Using BERT

Answer

A

To use BERT, use model as first “layer” of final model and then train on desired task

Question 8

Q

BERT

Masked Word Prediction

Answer

A

15% of all word tokens in each sentence selected at random. Of that 15%:
* 80%: substitute input word with MASK
* 10%: substitute input word with random word
* 10%: no change

Question 9

Q

BERT

Pretraining and Fine Tuning

Answer

A

Pretraining: having model learn language and context using MLM and sentence prediction
Fine Tuning: adjusting model parameters to fit certain tasks by running said task on the pretained embeddings

Question 10

Q

GPT

Answer

A

Generative Pretained Transformers
Uses unidirectional language modeling as pre-training objective
Only uses decoder portion of Transformer
Pretraining + fine-tuning
Uses masked self-attention

Question 11

Q

GPT

Masked Self-Attention

Answer

A

Only attends to words that occur before the current word.
Uses this to generate new word based on previous words.

Question 12

Q

GPT

GTP-2

Answer

A

Scale-up of GPT1, 10x more parameters and trained on 10x more data
Zero-shot learning

Question 13

Q

GPT

Zero-shot

Answer

A

Used by GPT2
Model only given natural language description of task
No gradient updates (fine-tuning) is performed
i.e. only use pre-trained checkpoint

Question 14

Q

GPT

GPT-3

Answer

A

GPT-2 but even larger 1.5B -> 175B parameter models
Uses few-shot learning

Question 15

Q

GPT

One-shot

Answer

A

Model given task description and single example of the task
No gradient updates (fine-tuning) are performed

Question 16

Q

GPT

Few-shot

Answer

A

Used by GPT3
Model given task description and few examples of the task
No gradients updates (fine-tuning) are preformed
Only works with the very largest models

Question 17

Q

BERT

How are output/embeddings created?

Answer

A

Pretrained token embeddings + segment embeddings + position embeddings
Segment embeddings: which sentence?
Position embeddings: what position in sentence?

Question 18

Q

BERT

What are BERTs cons?

Answer

A

Need a lot of pretrained data

Question 19

Q

GPT

Task Specific Pretraining and Fine-Tuning

Answer

A

Task Specific Pretraining: First randomly initialize embeddings, and train on some task that isn’t target task
Fine-Tuning: Use transfer learning and adjust parameters based on target task

Question 20

Q

GPT

Transfer Learning

Answer

A

Transfer knowledge from one task to another

Question 21

Q

GPT

GPT3: Issues with Fine-Tuning

Answer

A

Still needs a lot of data
Overfitting is easy bc fine tuning causes model to be very close to target
Not how humans learn, humans only need a few samples
Not fluid to understanding broad language processing,