LARGE LANGUAGE MODELS (LLMs) Flashcards
3 types of LLM Architecture: Transformer encoder
for sequence to label
eg used by BERT
(no decoder)
Allows any two states to connect
Used for POS tagging
3 types of LLM Architecture: Transformer decoder
for sequence generation
eg used by ChatGPT
Prediction from previous states is the input of the current state
autoregressive: it produces a token taking into account all previous states (slower than parallel processing)
No connection to the subsequent states
3 types of LLM Architecture: Transformer encoder-decoder
Hybrid
sequence to sequence
used by text to text transfer transformer
What are Model Parameters
An LLM contains a large amount of parameters like input vectors for tokens (token embeddings), neural network weights and projection matrices for attention, etc
What is the Objective of training
To find the best (or sufficiently good) values for the model parameters using training instances
(optimisation)
What is the Objective Function in training
What we want to optimise (maximise or minimise)
eg “performance function” or “loss function”
Performance measures examined over a collection of tasks using training instances
Eg:
Whether the model can correctly predict the ground-truth output given the input instances
What is Training
Finding good model parameters by optimizing the objective function O(Θ), e.g., maximizing the log-likelihood and/or minimizing the cross-entropy loss, computed over the training instances
What is cross entropy loss
A loss function to approximate classification error
outputs between 0 and 1
What is Iterative optimisation
Given an objective function O(Θ), one starts from a random initial guess of Θ, and iteratively applies a change to the guess to increase (or decrease) O(Θ)
What is Mini-batch Gradient Descent for Training LLM
We update the parameters based on an estimated gradient of the objective function, computed over a small set (called a batch) of training instances - instead of the whole dataset
What are hyperparameters
They are training parameters to determine before the training starts
not learned from the data but are rather choices made by the practitioner to configure the learning process: crucial for a good model
What is the main two-step approach in LLM training
- Unsupervised pretraining using unlabelled text.
- Supervised fine-tuning based on downstream NLP tasks
The success of an LLM largely relies on its training design, such as:
- how to construct training tasks
- how to prepare training data
What is an Epoch
A complete pass through the entire training dataset
During, the model sees and processes every example in the training dataset exactly once
epoch hyperparameter - how many times the algorithm will work through the entire training dataset
What is BERT training
Bidirectional Encoder Representations from Transformers (BERT)
Uses a Transformer Encoder structure, taking either a single sentence or a sentence pair as an input
sentence pair: combined sequence of tokens from two sentences,
starts with [CLS] and [SEP] in between two sentences
e.g“[CLS] my dog is cute [SEP] he likes playing”.
What is the BERT Pretraining
It is pretrained on two text collections:
- Book corpus
- English Wikipedia
It is pretrained on two learning tasks:
-masked language model (MLM) pretraining
-next sentence prediction (NSP) pretraining