LARGE LANGUAGE MODELS (LLMs) Flashcards

1
Q

3 types of LLM Architecture: Transformer encoder

A

for sequence to label
eg used by BERT
(no decoder)
Allows any two states to connect
Used for POS tagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

3 types of LLM Architecture: Transformer decoder

A

for sequence generation
eg used by ChatGPT
Prediction from previous states is the input of the current state
autoregressive: it produces a token taking into account all previous states (slower than parallel processing)
No connection to the subsequent states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 types of LLM Architecture: Transformer encoder-decoder

A

Hybrid
sequence to sequence
used by text to text transfer transformer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Model Parameters

A

An LLM contains a large amount of parameters like input vectors for tokens (token embeddings), neural network weights and projection matrices for attention, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Objective of training

A

To find the best (or sufficiently good) values for the model parameters using training instances
(optimisation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Objective Function in training

A

What we want to optimise (maximise or minimise)
eg “performance function” or “loss function”
Performance measures examined over a collection of tasks using training instances
Eg:
Whether the model can correctly predict the ground-truth output given the input instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Training

A

Finding good model parameters by optimizing the objective function O(Θ), e.g., maximizing the log-likelihood and/or minimizing the cross-entropy loss, computed over the training instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cross entropy loss

A

A loss function to approximate classification error
outputs between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Iterative optimisation

A

Given an objective function O(Θ), one starts from a random initial guess of Θ, and iteratively applies a change to the guess to increase (or decrease) O(Θ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Mini-batch Gradient Descent for Training LLM

A

We update the parameters based on an estimated gradient of the objective function, computed over a small set (called a batch) of training instances - instead of the whole dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are hyperparameters

A

They are training parameters to determine before the training starts
not learned from the data but are rather choices made by the practitioner to configure the learning process: crucial for a good model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main two-step approach in LLM training

A
  • Unsupervised pretraining using unlabelled text.
  • Supervised fine-tuning based on downstream NLP tasks

The success of an LLM largely relies on its training design, such as:
- how to construct training tasks
- how to prepare training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an Epoch

A

A complete pass through the entire training dataset
During, the model sees and processes every example in the training dataset exactly once
epoch hyperparameter - how many times the algorithm will work through the entire training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is BERT training

A

Bidirectional Encoder Representations from Transformers (BERT)
Uses a Transformer Encoder structure, taking either a single sentence or a sentence pair as an input
sentence pair: combined sequence of tokens from two sentences,
starts with [CLS] and [SEP] in between two sentences
e.g“[CLS] my dog is cute [SEP] he likes playing”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the BERT Pretraining

A

It is pretrained on two text collections:
- Book corpus
- English Wikipedia

It is pretrained on two learning tasks:
-masked language model (MLM) pretraining
-next sentence prediction (NSP) pretraining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BERT Training: MLM

A

Text is preprocessed into tokens
Randomly selects 15% tokens, within that:
10% - keep unchanged
80% - replace with [MASK]
10% - replace with random token

Task - predict the masked tokens
If it works - trained successfully

17
Q

BERT Training: NSP

A

Want to extract sentence pairs from corpus
For each sentence:
-50% - select sentence that truly follows and label with ‘yes’
-50% - select random sentence and label with ‘no’

Binary classification task - classify an input sentence pair to yes or no

18
Q

What is BERT Fine-tuning

A

BERT is finetuned on a collection of NLP tasks that exists
The tasks are converted to classification tasks

19
Q

What is GLUE BENCHMARK

A

A collection of NLP tasks that can be used to fine tune BERT

20
Q

What are Generative Pre-trained Transformers (GPT)

A

A type of LLMs introduced by OpenAI
uses a transformer decoder structure, supported by generative pretraining and supervised fine-tuning

21
Q

What is GPT training

A

generative pretraining predicts each token from its previous k tokens (using a conditional modelling method)
The autoregressive decoder structure enables adding the predicted token to the input tokens

22
Q

What is GPT Fine tuning

A

also trained on the glue benchmark
After unsupervised pretraining
Supervised finetuning involves simply concatenating correct input and output sequence pairs, with a delimit token, e.g., “$”

The pairs are passed to the decoder to form a representation vector which can then be used in the prediction layer for specific classification tasks

23
Q

What is Instruction Fine-tuning

A

an effective way to prepare input-output pairs for training
describes all NLP tasks using natural language instructions
and fine tunes an LLM to understand and process these instructions

24
Q

What is Chain of Thought Annotation

A

Improves LLM for unseen reasoning tasks
prepares the data better by asking a human expert to annotate the reasoning of the data
Eg add an annotation to the answer to explain how it got there

25
What is Learning from Human Feedback
using reinforcement learning with human feedback Over 50 experts asked to give recommendations which was fed into mitigations and improvements for the model