- Issue: no single transformer block can capture all different kinds of parallel relations among its inputs (e.g. syntactic, semantic, relationship between words) - Solution: sets of self-attention layers, called heads, that reside in parallel layers at the same depth in a model, each with its own set of parameters -> Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs at the same level of abstraction

- final transformer layer produces an output distribution over the entire vocabulary - During training, the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence - loss for a training sequence is the average cross-entropy loss over the entire sequence (= RNN) - each training item can be processed in parallel since the output for each element in the sequence is computed separately ( RNN)

- Bidirectional Encoder Representations from Transformers - designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (respective output output y depends on inputs xi before and after xi) - pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks e.g. QA and language inference, without substantial task-specific architecture modifications

- tokenization - input embeddings generation - pre-training - finetuning

- NFD Normalization = Normalization Form Canoical Decomposition, careful if use fast version - Split punctuation & convert whitespaces - Subword segmentation (advantage of characters & words) - Add special tokens (CLS & SEP)

Session 3 Flashcards by Linda Caro

Advantage Transformer over LSTM & RNN

LSTM & RNN: passing information through an extended series of recurrent connections leads to information loss and difficulties in training
LSTM & RNN: sequential nature hard to do computation in parallel

How well did you know this?

Not at all

Perfectly

Transformer Architecture

stacks of transformer blocks
blocks = multilayer networks of self-attention layer, normalization layer, FF layer, redisual connection

How well did you know this?

Not at all

Perfectly

Self-attention layers

directly extract & use info from arbitrarily large contexts without the need to pass it through intermediate recurrent connections (RNN)
goal: What info from input to consider?
comparison of an item of interest to a collection of other items (e.g. by dot producs look at similarity of vectors)
Access to all inputs up to & including current input -> use for autoregressive generation
Each calculation independent of others -> parallelization possible

How well did you know this?

Not at all

Perfectly

Self-attetnion layers functions to learn

Query = Learn what to focus on form current word
Key = How important as context
Value = Representation of subword (used as ouput)

How well did you know this?

Not at all

Perfectly

Self-attention layer steps

Generate Query, Key & Value vector for each word by multiplying embedding by three matrices that we trained during the training process
Score each word of input sequence against a word by dot product of query & key vector -> how much focus to place on other parts of the input sentences (e.g. k = key of target word, q1 * k1 = importance of “thinking” for the word thinking; q1 * k2 = importance of word “machines” for word “thinking”)
Divide scores by 8
pass through softmax operation (-> scores all positive & add up to 1)
Multiply each value vector by softmax score -> keep relevant values & drown-out irrelevant (e.g. by multiplying by 0.0001)
Sum weighted value vectors (z1 = v1 + v2 -> vector that is send to FFN)

-> for faster processing calculation done in matrix (instead of single word embeddings, put them into matrix

How well did you know this?

Not at all

Perfectly

Residual Connections

= connections that pass information from a lower layer to a higher layer without going through the intermediate layer: Allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layers

How well did you know this?

Not at all

Perfectly

Layer Normalization

vector components are normalized by subtracting the mean from each and dividing by the standard deviation

How well did you know this?

Not at all

Perfectly

Multihead Attention

Issue: no single transformer block can capture all different kinds of parallel relations among its inputs (e.g. syntactic, semantic, relationship between words)
Solution: sets of self-attention layers, called heads, that reside in parallel layers at the same depth in a model, each with its own set of parameters -> Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs at the same level of abstraction

How well did you know this?

Not at all

Perfectly

Positional Embeddings

Issue: model has no info about position of tokens in the input (<-> RNN is built in the structure, feeding in depending on position)
Solution: to modify the input embeddings by combining them with positional embeddings specific to each position in an input sequence

How well did you know this?

Not at all

Perfectly

Transfomer Training

final transformer layer produces an output distribution over the entire vocabulary
During training, the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence
loss for a training sequence is the average cross-entropy loss over the entire sequence (= RNN)
each training item can be processed in parallel since the output for each element in the sequence is computed separately (<-> RNN)

How well did you know this?

Not at all

Perfectly

Cross-entropy loss

Distance between gold distribution & prediction

How well did you know this?

Not at all

Perfectly

Perplexity

Inverse probability of the test set, normalized by number of words

How well did you know this?

Not at all

Perfectly

BERT definition

Bidirectional Encoder Representations from Transformers
designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (respective output output y depends on inputs xi before and after xi)
pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks e.g. QA and language inference, without substantial task-specific architecture modifications

How well did you know this?

Not at all

Perfectly

BERT steps

tokenization
input embeddings generation
pre-training
finetuning

How well did you know this?

Not at all

Perfectly

BERT - tokenization

NFD Normalization = Normalization Form Canoical Decomposition, careful if use fast version
Split punctuation & convert whitespaces
Subword segmentation (advantage of characters & words)
Add special tokens (CLS & SEP)

How well did you know this?

Not at all

Perfectly

BERT- subword segmentation

Study These Flashcards

goal = something close to morpheme = smallest meaningful constituent of a linguistic expression <-> syllable = unit of pronunciation (e.g. tokenizer -> tok-en-iz-er or token-izer)
start with vocuablary of all characters as subwords, predix non-start characters with ## -> rank all possible mergers with score & choose highest; redo until desired vocabulary size
Greedily match longest possible subwords from left
Use [UNK] for unseen character sequence

BERT - Input embeddings

Study These Flashcards

All embeddings are randomly initialized and learned by training (* (<-> original transforms use Sinusoidal function for generation positional embeddings)
Input = sum of 3 embeddings: token, segment & position embeddings

BERT - segment embeddings

Study These Flashcards

shows which sentence it belongs to and at what position it is -> segment embeddings are not used anymore because not effective (can probably just use token embeddings with SEP etc.)

BERT - position embeddings

Study These Flashcards

position of word in sentence encoded in vectors
-> segment & position to preserve ordering (embeddings are fed in simultaneously)

BERT - pre-training

Study These Flashcards

model is trained on unlabeled data over different pre-training tasks; extract generalization from large amount of text (understand language); simultaneously: Masked language model (MLM) & Next Sentence Prediction

BERT - pre-training MLM

Study These Flashcards

predict the original vocabulary of the masked word based only on its context
randomly masks some of the tokens from the input
output: only for the words that are replaced (used for training)
15% of the token positions at random for prediction of which 80% of time replaced with [MASK] token , 10% with random token, 10% unchanged

BERT - pre-training NSP

Study These Flashcards

Task: prediction whether each pair of sentences consists of actual pair of adjacent sentences from training corpus or unrelated
2 new tokens to input representation
1. CLS: prepended to input sentence pair
2. SEP: placed between sentences & after final token of the second sentence
50% of training pairs = positive pairs, others randomly selected from elsewhere in corpus

BERT - fine-tuning

Study These Flashcards

create applications on top of pre-trained model (e.g. adding lightweight classifier layer on top of outputs of pretrained model) (learn specific task)
Use labeled data from application to train additional application-specific parameters for downstream task (Each downstream task has separate fine-tuned models
initialized with the pre-trained parameters (stays untouched or minimal adjustments)

BERT results evaluation

Study These Flashcards

GLUE = General Language Understanding Evaluation = collection of diverse natural language understanding tasks including
Linguistic acceptability
Sentiment analysis (binary)
Paraphrase detection
Tecxtual similarity (regression)
Recognizing textual entailment (contradiction, neutral, positive)

mBERT

- Trained on data of > 100 languages - Remarkable cross-lingual performance when fine-tuning on only one language

RoBERTa

No NSP, larger mini-batches & learning rated, more data, better performance

Session 3 Flashcards

(26 cards)