LARGE LANGUAGE MODELS (LLMs) Flashcards by Isabel Draper

3 types of LLM Architecture: Transformer encoder

for sequence to label
eg used by BERT
(no decoder)
Allows any two states to connect
Used for POS tagging

How well did you know this?

Not at all

Perfectly

3 types of LLM Architecture: Transformer decoder

for sequence generation
eg used by ChatGPT
Prediction from previous states is the input of the current state
autoregressive: it produces a token taking into account all previous states (slower than parallel processing)
No connection to the subsequent states

How well did you know this?

Not at all

Perfectly

3 types of LLM Architecture: Transformer encoder-decoder

Hybrid
sequence to sequence
used by text to text transfer transformer

How well did you know this?

Not at all

Perfectly

What are Model Parameters

An LLM contains a large amount of parameters like input vectors for tokens (token embeddings), neural network weights and projection matrices for attention, etc

How well did you know this?

Not at all

Perfectly

What is the Objective of training

To find the best (or sufficiently good) values for the model parameters using training instances
(optimisation)

How well did you know this?

Not at all

Perfectly

What is the Objective Function in training

What we want to optimise (maximise or minimise)
eg “performance function” or “loss function”
Performance measures examined over a collection of tasks using training instances
Eg:
Whether the model can correctly predict the ground-truth output given the input instances

How well did you know this?

Not at all

Perfectly

What is Training

Finding good model parameters by optimizing the objective function O(Θ), e.g., maximizing the log-likelihood and/or minimizing the cross-entropy loss, computed over the training instances

How well did you know this?

Not at all

Perfectly

What is cross entropy loss

A loss function to approximate classification error
outputs between 0 and 1

How well did you know this?

Not at all

Perfectly

What is Iterative optimisation

Given an objective function O(Θ), one starts from a random initial guess of Θ, and iteratively applies a change to the guess to increase (or decrease) O(Θ)

How well did you know this?

Not at all

Perfectly

What is Mini-batch Gradient Descent for Training LLM

We update the parameters based on an estimated gradient of the objective function, computed over a small set (called a batch) of training instances - instead of the whole dataset

How well did you know this?

Not at all

Perfectly

What are hyperparameters

They are training parameters to determine before the training starts
not learned from the data but are rather choices made by the practitioner to configure the learning process: crucial for a good model

How well did you know this?

Not at all

Perfectly

What is the main two-step approach in LLM training

Unsupervised pretraining using unlabelled text.
Supervised fine-tuning based on downstream NLP tasks

The success of an LLM largely relies on its training design, such as:
- how to construct training tasks
- how to prepare training data

How well did you know this?

Not at all

Perfectly

What is an Epoch

A complete pass through the entire training dataset
During, the model sees and processes every example in the training dataset exactly once
epoch hyperparameter - how many times the algorithm will work through the entire training dataset

How well did you know this?

Not at all

Perfectly

What is BERT training

Bidirectional Encoder Representations from Transformers (BERT)
Uses a Transformer Encoder structure, taking either a single sentence or a sentence pair as an input
sentence pair: combined sequence of tokens from two sentences,
starts with [CLS] and [SEP] in between two sentences
e.g“[CLS] my dog is cute [SEP] he likes playing”.

How well did you know this?

Not at all

Perfectly

What is the BERT Pretraining

It is pretrained on two text collections:
- Book corpus
- English Wikipedia

It is pretrained on two learning tasks:
-masked language model (MLM) pretraining
-next sentence prediction (NSP) pretraining

How well did you know this?

Not at all

Perfectly

BERT Training: MLM

Study These Flashcards

Text is preprocessed into tokens
Randomly selects 15% tokens, within that:
10% - keep unchanged
80% - replace with [MASK]
10% - replace with random token

Task - predict the masked tokens
If it works - trained successfully

BERT Training: NSP

Study These Flashcards

Want to extract sentence pairs from corpus
For each sentence:
-50% - select sentence that truly follows and label with ‘yes’
-50% - select random sentence and label with ‘no’

Binary classification task - classify an input sentence pair to yes or no

What is BERT Fine-tuning

Study These Flashcards

BERT is finetuned on a collection of NLP tasks that exists
The tasks are converted to classification tasks

What is GLUE BENCHMARK

Study These Flashcards

A collection of NLP tasks that can be used to fine tune BERT

What are Generative Pre-trained Transformers (GPT)

Study These Flashcards

A type of LLMs introduced by OpenAI
uses a transformer decoder structure, supported by generative pretraining and supervised fine-tuning

What is GPT training

Study These Flashcards

generative pretraining predicts each token from its previous k tokens (using a conditional modelling method)
The autoregressive decoder structure enables adding the predicted token to the input tokens

What is GPT Fine tuning

Study These Flashcards

also trained on the glue benchmark
After unsupervised pretraining
Supervised finetuning involves simply concatenating correct input and output sequence pairs, with a delimit token, e.g., “$”

The pairs are passed to the decoder to form a representation vector which can then be used in the prediction layer for specific classification tasks

What is Instruction Fine-tuning

Study These Flashcards

an effective way to prepare input-output pairs for training
describes all NLP tasks using natural language instructions
and fine tunes an LLM to understand and process these instructions

What is Chain of Thought Annotation

Study These Flashcards

Improves LLM for unseen reasoning tasks
prepares the data better by asking a human expert to annotate the reasoning of the data
Eg add an annotation to the answer to explain how it got there

What is Learning from Human Feedback

using reinforcement learning with human feedback Over 50 experts asked to give recommendations which was fed into mitigations and improvements for the model

LARGE LANGUAGE MODELS (LLMs) Flashcards

(25 cards)