Hoorcollege 11 pretrained language models Flashcards
1
Q
ELMO architecture
A
- 2 layer, bidirectional
- word vector: weighted sum of hidden states
- aH0 + bH1 + cH2 = F(H0, H1, H2)
2
Q
GPT (generative pre-trained transformers) architecture
A
- Transformer architecture, undirectional (“decoder”)
- Each token gets a vector for token level tasks
- Whole sequence also get’s a vector
3
Q
BERT
A
- Bidirectional
- Has for each token a bidirectionally contextualized representation at each layer
- For this it uses either the trick of replacing 15% of tokens by [MASK] or Next Sentence
Prediction
4
Q
NSP (Next sentence predicting)
A
- uses CLS tokens to separate sentences and sentence into vector as input to NSP
classifier to predict if the second sentence would be a logical next sentence or not - C of [cls] -> classify sequences
- sequences useful for: sentiment analysis, NLI, paraphrasing
- token vectors useful for: POS tagging, NER, WSD
- different layers are useful in different tasks
5
Q
Ways of using pretrained models
A
- Freeze: use the embeddings ars they are
- Fine-tuning: adjust the model’s parameters while training the classifier
6
Q
Transformer
A
MultiHeadAttention -> addition -> LayerNorm -> FFN -> addition not of data ->
LayerNorm
7
Q
Generation as sampling
A
- greedy
- Beam search
- Random sampling with y =
SoftMax(u)
Random sampling: use top-k sampling, top-p sampling, temperature sampling
Scaling laws, but not only size matters
8
Q
Winograd schema resolution
A
for “he” having a P to be either one or another person replace
“he” by name and calculate P