Session 3 Flashcards
Advantage Transformer over LSTM & RNN
- LSTM & RNN: passing information through an extended series of recurrent connections leads to information loss and difficulties in training
- LSTM & RNN: sequential nature hard to do computation in parallel
Transformer Architecture
- stacks of transformer blocks
- blocks = multilayer networks of self-attention layer, normalization layer, FF layer, redisual connection
Self-attention layers
- directly extract & use info from arbitrarily large contexts without the need to pass it through intermediate recurrent connections (RNN)
- goal: What info from input to consider?
- comparison of an item of interest to a collection of other items (e.g. by dot producs look at similarity of vectors)
- Access to all inputs up to & including current input -> use for autoregressive generation
- Each calculation independent of others -> parallelization possible
Self-attetnion layers functions to learn
- Query = Learn what to focus on form current word
- Key = How important as context
- Value = Representation of subword (used as ouput)
Self-attention layer steps
- Generate Query, Key & Value vector for each word by multiplying embedding by three matrices that we trained during the training process
- Score each word of input sequence against a word by dot product of query & key vector -> how much focus to place on other parts of the input sentences (e.g. k = key of target word, q1 * k1 = importance of “thinking” for the word thinking; q1 * k2 = importance of word “machines” for word “thinking”)
- Divide scores by 8
- pass through softmax operation (-> scores all positive & add up to 1)
- Multiply each value vector by softmax score -> keep relevant values & drown-out irrelevant (e.g. by multiplying by 0.0001)
- Sum weighted value vectors (z1 = v1 + v2 -> vector that is send to FFN)
-> for faster processing calculation done in matrix (instead of single word embeddings, put them into matrix
Residual Connections
= connections that pass information from a lower layer to a higher layer without going through the intermediate layer: Allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layers
Layer Normalization
vector components are normalized by subtracting the mean from each and dividing by the standard deviation
Multihead Attention
- Issue: no single transformer block can capture all different kinds of parallel relations among its inputs (e.g. syntactic, semantic, relationship between words)
- Solution: sets of self-attention layers, called heads, that reside in parallel layers at the same depth in a model, each with its own set of parameters -> Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs at the same level of abstraction
Positional Embeddings
- Issue: model has no info about position of tokens in the input (<-> RNN is built in the structure, feeding in depending on position)
- Solution: to modify the input embeddings by combining them with positional embeddings specific to each position in an input sequence
Transfomer Training
- final transformer layer produces an output distribution over the entire vocabulary
- During training, the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence
- loss for a training sequence is the average cross-entropy loss over the entire sequence (= RNN)
- each training item can be processed in parallel since the output for each element in the sequence is computed separately (<-> RNN)
Cross-entropy loss
Distance between gold distribution & prediction
Perplexity
Inverse probability of the test set, normalized by number of words
BERT definition
- Bidirectional Encoder Representations from Transformers
- designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (respective output output y depends on inputs xi before and after xi)
- pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks e.g. QA and language inference, without substantial task-specific architecture modifications
BERT steps
- tokenization
- input embeddings generation
- pre-training
- finetuning
BERT - tokenization
- NFD Normalization = Normalization Form Canoical Decomposition, careful if use fast version
- Split punctuation & convert whitespaces
- Subword segmentation (advantage of characters & words)
- Add special tokens (CLS & SEP)