LLM Modeling Flashcards
What model did we use in the class
Flan-T5
RNN
recurrent neural networks (PREVIOUS GEN) à only to word before it
LLM
large language models = all words to each other; and weight of attention/influence between the words
Tokenize
Convert each word into numbers (which is store in a vector)
Self Attention
analyzes the relationships between the tokens
Encoder
inputs prompts with contextual understand and outputs vector
Decoder
accepts inputs token and generators token
Sequence to Sequence
encoder to decoder model; Translation, text summarization, answering questions, is a sequence to sequence task (T5, BART)
Decoder only mode
good at generating text (GPT)
Zero Shot Inference
pass no grading/sentiment (type of In-context learning (ICL))
One shot Inference
pass one grading/sentiment (type of In-context learning (ICL))
Few shot inference
pass few grading/sentiment (type of In-context learning (ICL))
Greedy
always take the most probably word, which will have the outputs be the same over and over again
Big data
When LMM is too big for a single GPU
DDP
Distributed Data Parallel
Fully Shredded Data Parallel (FSDP)
BIGGER SCALE reduce memory by distributing/sharding model parameter across GPUs
Three main variables for scale
1) Constraints (GPU, time, cost) 2)Data set Size (number of token ) 3) Model Size (number of parameters)
Chinchilla
very large models may be over parameterized and under trained à thus less parameters but feed it more data verse bigger and bigger
Fine Tuning an existing model
the output is a new model
Specific Use Case Training
500 examples (via prompt completion pairs) –> resets all parameters –> could lead to catastrophic forgetting (but that may not matter for a single type of use implementation) –> Very CPU intensive
Multiple Use Case Training
1000’s of examples across multiple tasks –> resets all parameters –> less likely to have catastrophic forgetting since its trained across multiple tasks
PEFT (Parameter Efficient Fine Tuning)
small number of trainable parameters; rest are frozen –> MUCH MORE EFFICIENT
Reparametrize model weight (LoRA)
i. Freezes most of original weights
Inject matrixes to update model; updates the non frozen
Additive
add training layer or parameter to model –> KEEP THE ENTIRE EXISTING MODEL
PROMT ENGINEERING
one shot inference etc
PROMPT TUNING
Prompt Tuning: It fine-tune the LLM but with the structured data, which is consisted of some contents like “instruction”, “response” etc. - Fine Tuning: It fine-tune the LLM with unstructured data like raw text.
FLAN
Fine- Tuned Language Net are the specific instruction used to perform fine tuning
Flan - T5; FLAM-PALM is the tuning of the t% and PALM model
Rogue
used for text summarization, compares to 1 or more reference summaries
Recall
I very long response can have a recall of 100% but be too wordy
Precision
how many extra words are there in the output?
F1
A ratio of recall and precision
BLEU
used for text translation; compares to human generated translation
RLHF
Reinforcement Learning from Human Feedback - Tuning a model to be helpful, honest, harmless (Three HHH)
Have humans tag how ‘good’ a response by comparing 3 options (how helpful? Or how harmful? Or how honest?)
PPO
Proximal Policy Optimization - a popular algorithm who helps solve reinforcement learning problems. Makes updates within a very small region (proximal) to LLM via many iterations to bad handle HHH
Reward model
supervised learning taking your human prompts and ‘rewarding’ the human tagged responses [comparing classes hate vs not hate probability)
Reward Hacking
- where a model tries to optimize it’s scores by making answers that are long and wordy
Avoid reward hacking by comparing to Reference Model via KL Divergence Shift Penalty
Constitutional AI
allows you to scale Reinforcement Learning without human intervention. Constitutional AI (CAI) is similar to RLHF except instead of human feedback, it learns through AI feedback.
LLM Optimization Techniques
Distillation - train a smaller student model from a larger teacher model
Post Training Quantization (PTQ) - reduce precision of model weights (aka from 32 bits to 8 bits)
Pruning - remove model weights with values close to or equal to 0 (in theory makes sense to reduce , but in actuality there may not be many weights are Zero or close to 0
3 types of issues of models
1) Out of date
2) Bad at math (can’t do calculation)
Hallucinations and guessing the answers it doesn’t know
How to mitigate issues with models
RAG-retrieval augmented generating; get the details directly from a DB/API then pass I to the model
Chain of Thought –> Provide hints of how to break the problem into smaller parts [good for simple problems]
Program Aided Language Models (PAL)- integrate with python to do the math and model to call python
Responsible AI - how do mitigate:
Toxicity [curating training data, train guardrail models, diverse group of human annotators], Hallucination [educate/add disclaimers]
Intellectual Property [not easy, machine ‘unlearning’, filtering blocking]`
Existing metrics to measure hallucination
1) Rogue –> compare to an expected results
2) Ask Chat GPT to grade
3) Probability checks
A new way to score LLM Hallucinations (Gailio)
ChainPull
Pass results to chain of thought’s model which includes a score and a logic path
What does GPT stand for?
Generative Pre-trained Transformers, commonly known as GPT, are a family of neural network models that uses the transformer architecture and is a key advancement in artificial intelligence (AI) powering generative AI applications such as ChatGPT.