LARGE LANGUAGE MODELS (LLMs) Flashcards

1
Q

3 types of LLM Architecture: Transformer encoder

A

for sequence to label
eg used by BERT
(no decoder)
Allows any two states to connect
Used for POS tagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

3 types of LLM Architecture: Transformer decoder

A

for sequence generation
eg used by ChatGPT
Prediction from previous states is the input of the current state
autoregressive: it produces a token taking into account all previous states (slower than parallel processing)
No connection to the subsequent states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 types of LLM Architecture: Transformer encoder-decoder

A

Hybrid
sequence to sequence
used by text to text transfer transformer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Model Parameters

A

An LLM contains a large amount of parameters like input vectors for tokens (token embeddings), neural network weights and projection matrices for attention, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Objective of training

A

To find the best (or sufficiently good) values for the model parameters using training instances
(optimisation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Objective Function in training

A

What we want to optimise (maximise or minimise)
eg “performance function” or “loss function”
Performance measures examined over a collection of tasks using training instances
Eg:
Whether the model can correctly predict the ground-truth output given the input instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Training

A

Finding good model parameters by optimizing the objective function O(Θ), e.g., maximizing the log-likelihood and/or minimizing the cross-entropy loss, computed over the training instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cross entropy loss

A

A loss function to approximate classification error
outputs between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Iterative optimisation

A

Given an objective function O(Θ), one starts from a random initial guess of Θ, and iteratively applies a change to the guess to increase (or decrease) O(Θ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Mini-batch Gradient Descent for Training LLM

A

We update the parameters based on an estimated gradient of the objective function, computed over a small set (called a batch) of training instances - instead of the whole dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are hyperparameters

A

They are training parameters to determine before the training starts
not learned from the data but are rather choices made by the practitioner to configure the learning process: crucial for a good model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main two-step approach in LLM training

A
  • Unsupervised pretraining using unlabelled text.
  • Supervised fine-tuning based on downstream NLP tasks

The success of an LLM largely relies on its training design, such as:
- how to construct training tasks
- how to prepare training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an Epoch

A

A complete pass through the entire training dataset
During, the model sees and processes every example in the training dataset exactly once
epoch hyperparameter - how many times the algorithm will work through the entire training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is BERT training

A

Bidirectional Encoder Representations from Transformers (BERT)
Uses a Transformer Encoder structure, taking either a single sentence or a sentence pair as an input
sentence pair: combined sequence of tokens from two sentences,
starts with [CLS] and [SEP] in between two sentences
e.g“[CLS] my dog is cute [SEP] he likes playing”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the BERT Pretraining

A

It is pretrained on two text collections:
- Book corpus
- English Wikipedia

It is pretrained on two learning tasks:
-masked language model (MLM) pretraining
-next sentence prediction (NSP) pretraining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BERT Training: MLM

A

Text is preprocessed into tokens
Randomly selects 15% tokens, within that:
10% - keep unchanged
80% - replace with [MASK]
10% - replace with random token

Task - predict the masked tokens
If it works - trained successfully

17
Q

BERT Training: NSP

A

Want to extract sentence pairs from corpus
For each sentence:
-50% - select sentence that truly follows and label with ‘yes’
-50% - select random sentence and label with ‘no’

Binary classification task - classify an input sentence pair to yes or no

18
Q

What is BERT Fine-tuning

A

BERT is finetuned on a collection of NLP tasks that exists
The tasks are converted to classification tasks

19
Q

What is GLUE BENCHMARK

A

A collection of NLP tasks that can be used to fine tune BERT

20
Q

What are Generative Pre-trained Transformers (GPT)

A

A type of LLMs introduced by OpenAI
uses a transformer decoder structure, supported by generative pretraining and supervised fine-tuning

21
Q

What is GPT training

A

generative pretraining predicts each token from its previous k tokens (using a conditional modelling method)
The autoregressive decoder structure enables adding the predicted token to the input tokens

22
Q

What is GPT Fine tuning

A

also trained on the glue benchmark
After unsupervised pretraining
Supervised finetuning involves simply concatenating correct input and output sequence pairs, with a delimit token, e.g., “$”

The pairs are passed to the decoder to form a representation vector which can then be used in the prediction layer for specific classification tasks

23
Q

What is Instruction Fine-tuning

A

an effective way to prepare input-output pairs for training
describes all NLP tasks using natural language instructions
and fine tunes an LLM to understand and process these instructions

24
Q

What is Chain of Thought Annotation

A

Improves LLM for unseen reasoning tasks
prepares the data better by asking a human expert to annotate the reasoning of the data
Eg add an annotation to the answer to explain how it got there

25
Q

What is Learning from Human Feedback

A

using reinforcement learning with human
feedback
Over 50 experts asked to give recommendations which was fed into mitigations and improvements for the model