Session 3 Flashcards

1
Q

Advantage Transformer over LSTM & RNN

A
  • LSTM & RNN: passing information through an extended series of recurrent connections leads to information loss and difficulties in training
  • LSTM & RNN: sequential nature hard to do computation in parallel
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Transformer Architecture

A
  • stacks of transformer blocks
  • blocks = multilayer networks of self-attention layer, normalization layer, FF layer, redisual connection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Self-attention layers

A
  • directly extract & use info from arbitrarily large contexts without the need to pass it through intermediate recurrent connections (RNN)
  • goal: What info from input to consider?
  • comparison of an item of interest to a collection of other items (e.g. by dot producs look at similarity of vectors)
  • Access to all inputs up to & including current input -> use for autoregressive generation
  • Each calculation independent of others -> parallelization possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Self-attetnion layers functions to learn

A
  • Query = Learn what to focus on form current word
  • Key = How important as context
  • Value = Representation of subword (used as ouput)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Self-attention layer steps

A
  1. Generate Query, Key & Value vector for each word by multiplying embedding by three matrices that we trained during the training process
  2. Score each word of input sequence against a word by dot product of query & key vector -> how much focus to place on other parts of the input sentences (e.g. k = key of target word, q1 * k1 = importance of “thinking” for the word thinking; q1 * k2 = importance of word “machines” for word “thinking”)
  3. Divide scores by 8
  4. pass through softmax operation (-> scores all positive & add up to 1)
  5. Multiply each value vector by softmax score -> keep relevant values & drown-out irrelevant (e.g. by multiplying by 0.0001)
  6. Sum weighted value vectors (z1 = v1 + v2 -> vector that is send to FFN)

-> for faster processing calculation done in matrix (instead of single word embeddings, put them into matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Residual Connections

A

= connections that pass information from a lower layer to a higher layer without going through the intermediate layer: Allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Layer Normalization

A

vector components are normalized by subtracting the mean from each and dividing by the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Multihead Attention

A
  • Issue: no single transformer block can capture all different kinds of parallel relations among its inputs (e.g. syntactic, semantic, relationship between words)
  • Solution: sets of self-attention layers, called heads, that reside in parallel layers at the same depth in a model, each with its own set of parameters -> Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs at the same level of abstraction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Positional Embeddings

A
  • Issue: model has no info about position of tokens in the input (<-> RNN is built in the structure, feeding in depending on position)
  • Solution: to modify the input embeddings by combining them with positional embeddings specific to each position in an input sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Transfomer Training

A
  • final transformer layer produces an output distribution over the entire vocabulary
  • During training, the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence
  • loss for a training sequence is the average cross-entropy loss over the entire sequence (= RNN)
  • each training item can be processed in parallel since the output for each element in the sequence is computed separately (<-> RNN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cross-entropy loss

A

Distance between gold distribution & prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Perplexity

A

Inverse probability of the test set, normalized by number of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BERT definition

A
  • Bidirectional Encoder Representations from Transformers
  • designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (respective output output y depends on inputs xi before and after xi)
  • pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks e.g. QA and language inference, without substantial task-specific architecture modifications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BERT steps

A
  • tokenization
  • input embeddings generation
  • pre-training
  • finetuning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BERT - tokenization

A
  • NFD Normalization = Normalization Form Canoical Decomposition, careful if use fast version
  • Split punctuation & convert whitespaces
  • Subword segmentation (advantage of characters & words)
  • Add special tokens (CLS & SEP)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BERT- subword segmentation

A
  • goal = something close to morpheme = smallest meaningful constituent of a linguistic expression <-> syllable = unit of pronunciation (e.g. tokenizer -> tok-en-iz-er or token-izer)
  • start with vocuablary of all characters as subwords, predix non-start characters with ## -> rank all possible mergers with score & choose highest; redo until desired vocabulary size
  • Greedily match longest possible subwords from left
  • Use [UNK] for unseen character sequence
17
Q

BERT - Input embeddings

A
  • All embeddings are randomly initialized and learned by training (* (<-> original transforms use Sinusoidal function for generation positional embeddings)
  • Input = sum of 3 embeddings: token, segment & position embeddings
18
Q

BERT - segment embeddings

A

shows which sentence it belongs to and at what position it is -> segment embeddings are not used anymore because not effective (can probably just use token embeddings with SEP etc.)

19
Q

BERT - position embeddings

A

position of word in sentence encoded in vectors
-> segment & position to preserve ordering (embeddings are fed in simultaneously)

20
Q

BERT - pre-training

A

model is trained on unlabeled data over different pre-training tasks; extract generalization from large amount of text (understand language); simultaneously: Masked language model (MLM) & Next Sentence Prediction

21
Q

BERT - pre-training MLM

A
  • predict the original vocabulary of the masked word based only on its context
  • randomly masks some of the tokens from the input
  • output: only for the words that are replaced (used for training)
  • 15% of the token positions at random for prediction of which 80% of time replaced with [MASK] token , 10% with random token, 10% unchanged
22
Q

BERT - pre-training NSP

A
  • Task: prediction whether each pair of sentences consists of actual pair of adjacent sentences from training corpus or unrelated
  • 2 new tokens to input representation
    1. CLS: prepended to input sentence pair
    2. SEP: placed between sentences & after final token of the second sentence
  • 50% of training pairs = positive pairs, others randomly selected from elsewhere in corpus
23
Q

BERT - fine-tuning

A
  • create applications on top of pre-trained model (e.g. adding lightweight classifier layer on top of outputs of pretrained model) (learn specific task)
  • Use labeled data from application to train additional application-specific parameters for downstream task (Each downstream task has separate fine-tuned models
  • initialized with the pre-trained parameters (stays untouched or minimal adjustments)
24
Q

BERT results evaluation

A
  • GLUE = General Language Understanding Evaluation = collection of diverse natural language understanding tasks including
  • Linguistic acceptability
  • Sentiment analysis (binary)
  • Paraphrase detection
  • Tecxtual similarity (regression)
  • Recognizing textual entailment (contradiction, neutral, positive)
25
Q

mBERT

A
  • Trained on data of > 100 languages
  • Remarkable cross-lingual performance when fine-tuning on only one language
26
Q

RoBERTa

A

No NSP, larger mini-batches & learning rated, more data, better performance