Hoorcollege 11 pretrained language models Flashcards

Question 1

Q

ELMO architecture

Answer

A

Question 2

Q

GPT (generative pre-trained transformers) architecture

Answer

A

Question 3

Q

BERT

Answer

A

Bidirectional
Has for each token a bidirectionally contextualized representation at each layer
For this it uses either the trick of replacing 15% of tokens by [MASK] or Next Sentence
Prediction

Question 4

Q

NSP (Next sentence predicting)

Answer

A

uses CLS tokens to separate sentences and sentence into vector as input to NSP
classifier to predict if the second sentence would be a logical next sentence or not
C of [cls] -> classify sequences
sequences useful for: sentiment analysis, NLI, paraphrasing
token vectors useful for: POS tagging, NER, WSD
different layers are useful in different tasks

Question 5

Q

Ways of using pretrained models

Answer

A

Question 6

Q

Transformer

Answer

A

MultiHeadAttention -> addition -> LayerNorm -> FFN -> addition not of data ->
LayerNorm

Question 7

Q

Generation as sampling

Answer

A

greedy
Beam search
Random sampling with y =
SoftMax(u)
Random sampling: use top-k sampling, top-p sampling, temperature sampling

Scaling laws, but not only size matters

Question 8

Q

Winograd schema resolution

Answer

A

for “he” having a P to be either one or another person replace
“he” by name and calculate P