LARGE LANGUAGE MODELS (LLMs) Flashcards
3 types of LLM Architecture: Transformer encoder
for sequence to label
eg used by BERT
(no decoder)
Allows any two states to connect
Used for POS tagging
3 types of LLM Architecture: Transformer decoder
for sequence generation
eg used by ChatGPT
Prediction from previous states is the input of the current state
autoregressive: it produces a token taking into account all previous states (slower than parallel processing)
No connection to the subsequent states
3 types of LLM Architecture: Transformer encoder-decoder
Hybrid
sequence to sequence
used by text to text transfer transformer
What are Model Parameters
An LLM contains a large amount of parameters like input vectors for tokens (token embeddings), neural network weights and projection matrices for attention, etc
What is the Objective of training
To find the best (or sufficiently good) values for the model parameters using training instances
(optimisation)
What is the Objective Function in training
What we want to optimise (maximise or minimise)
eg “performance function” or “loss function”
Performance measures examined over a collection of tasks using training instances
Eg:
Whether the model can correctly predict the ground-truth output given the input instances
What is Training
Finding good model parameters by optimizing the objective function O(Θ), e.g., maximizing the log-likelihood and/or minimizing the cross-entropy loss, computed over the training instances
What is cross entropy loss
A loss function to approximate classification error
outputs between 0 and 1
What is Iterative optimisation
Given an objective function O(Θ), one starts from a random initial guess of Θ, and iteratively applies a change to the guess to increase (or decrease) O(Θ)
What is Mini-batch Gradient Descent for Training LLM
We update the parameters based on an estimated gradient of the objective function, computed over a small set (called a batch) of training instances - instead of the whole dataset
What are hyperparameters
They are training parameters to determine before the training starts
not learned from the data but are rather choices made by the practitioner to configure the learning process: crucial for a good model
What is the main two-step approach in LLM training
- Unsupervised pretraining using unlabelled text.
- Supervised fine-tuning based on downstream NLP tasks
The success of an LLM largely relies on its training design, such as:
- how to construct training tasks
- how to prepare training data
What is an Epoch
A complete pass through the entire training dataset
During, the model sees and processes every example in the training dataset exactly once
epoch hyperparameter - how many times the algorithm will work through the entire training dataset
What is BERT training
Bidirectional Encoder Representations from Transformers (BERT)
Uses a Transformer Encoder structure, taking either a single sentence or a sentence pair as an input
sentence pair: combined sequence of tokens from two sentences,
starts with [CLS] and [SEP] in between two sentences
e.g“[CLS] my dog is cute [SEP] he likes playing”.
What is the BERT Pretraining
It is pretrained on two text collections:
- Book corpus
- English Wikipedia
It is pretrained on two learning tasks:
-masked language model (MLM) pretraining
-next sentence prediction (NSP) pretraining
BERT Training: MLM
Text is preprocessed into tokens
Randomly selects 15% tokens, within that:
10% - keep unchanged
80% - replace with [MASK]
10% - replace with random token
Task - predict the masked tokens
If it works - trained successfully
BERT Training: NSP
Want to extract sentence pairs from corpus
For each sentence:
-50% - select sentence that truly follows and label with ‘yes’
-50% - select random sentence and label with ‘no’
Binary classification task - classify an input sentence pair to yes or no
What is BERT Fine-tuning
BERT is finetuned on a collection of NLP tasks that exists
The tasks are converted to classification tasks
What is GLUE BENCHMARK
A collection of NLP tasks that can be used to fine tune BERT
What are Generative Pre-trained Transformers (GPT)
A type of LLMs introduced by OpenAI
uses a transformer decoder structure, supported by generative pretraining and supervised fine-tuning
What is GPT training
generative pretraining predicts each token from its previous k tokens (using a conditional modelling method)
The autoregressive decoder structure enables adding the predicted token to the input tokens
What is GPT Fine tuning
also trained on the glue benchmark
After unsupervised pretraining
Supervised finetuning involves simply concatenating correct input and output sequence pairs, with a delimit token, e.g., “$”
The pairs are passed to the decoder to form a representation vector which can then be used in the prediction layer for specific classification tasks
What is Instruction Fine-tuning
an effective way to prepare input-output pairs for training
describes all NLP tasks using natural language instructions
and fine tunes an LLM to understand and process these instructions
What is Chain of Thought Annotation
Improves LLM for unseen reasoning tasks
prepares the data better by asking a human expert to annotate the reasoning of the data
Eg add an annotation to the answer to explain how it got there
What is Learning from Human Feedback
using reinforcement learning with human
feedback
Over 50 experts asked to give recommendations which was fed into mitigations and improvements for the model