LLM Modeling Flashcards

1
Q

What model did we use in the class

A

Flan-T5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN

A

recurrent neural networks (PREVIOUS GEN) à only to word before it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

LLM

A

large language models = all words to each other; and weight of attention/influence between the words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tokenize

A

Convert each word into numbers (which is store in a vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Self Attention

A

analyzes the relationships between the tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Encoder

A

inputs prompts with contextual understand and outputs vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decoder

A

accepts inputs token and generators token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sequence to Sequence

A

encoder to decoder model; Translation, text summarization, answering questions, is a sequence to sequence task (T5, BART)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Decoder only mode

A

good at generating text (GPT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Zero Shot Inference

A

pass no grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

One shot Inference

A

pass one grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Few shot inference

A

pass few grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Greedy

A

always take the most probably word, which will have the outputs be the same over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Big data

A

When LMM is too big for a single GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DDP

A

Distributed Data Parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fully Shredded Data Parallel (FSDP)

A

BIGGER SCALE reduce memory by distributing/sharding model parameter across GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Three main variables for scale

A

1) Constraints (GPU, time, cost) 2)Data set Size (number of token ) 3) Model Size (number of parameters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Chinchilla

A

very large models may be over parameterized and under trained à thus less parameters but feed it more data verse bigger and bigger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fine Tuning an existing model

A

the output is a new model

20
Q

Specific Use Case Training

A

500 examples (via prompt completion pairs) –> resets all parameters –> could lead to catastrophic forgetting (but that may not matter for a single type of use implementation) –> Very CPU intensive

21
Q

Multiple Use Case Training

A

1000’s of examples across multiple tasks –> resets all parameters –> less likely to have catastrophic forgetting since its trained across multiple tasks

22
Q

PEFT (Parameter Efficient Fine Tuning)

A

small number of trainable parameters; rest are frozen –> MUCH MORE EFFICIENT

23
Q

Reparametrize model weight (LoRA)

A

i. Freezes most of original weights
Inject matrixes to update model; updates the non frozen

24
Q

Additive

A

add training layer or parameter to model –> KEEP THE ENTIRE EXISTING MODEL

25
Q

PROMT ENGINEERING

A

one shot inference etc

26
Q

PROMPT TUNING

A

Prompt Tuning: It fine-tune the LLM but with the structured data, which is consisted of some contents like “instruction”, “response” etc. - Fine Tuning: It fine-tune the LLM with unstructured data like raw text.

27
Q

FLAN

A

Fine- Tuned Language Net are the specific instruction used to perform fine tuning

Flan - T5; FLAM-PALM is the tuning of the t% and PALM model

28
Q

Rogue

A

used for text summarization, compares to 1 or more reference summaries

29
Q

Recall

A

I very long response can have a recall of 100% but be too wordy

30
Q

Precision

A

how many extra words are there in the output?

31
Q

F1

A

A ratio of recall and precision

32
Q

BLEU

A

used for text translation; compares to human generated translation

33
Q

RLHF

A

Reinforcement Learning from Human Feedback - Tuning a model to be helpful, honest, harmless (Three HHH)

Have humans tag how ‘good’ a response by comparing 3 options (how helpful? Or how harmful? Or how honest?)

34
Q

PPO

A

Proximal Policy Optimization - a popular algorithm who helps solve reinforcement learning problems. Makes updates within a very small region (proximal) to LLM via many iterations to bad handle HHH

35
Q

Reward model

A

supervised learning taking your human prompts and ‘rewarding’ the human tagged responses [comparing classes hate vs not hate probability)

36
Q

Reward Hacking

A
  • where a model tries to optimize it’s scores by making answers that are long and wordy

Avoid reward hacking by comparing to Reference Model via KL Divergence Shift Penalty

37
Q

Constitutional AI

A

allows you to scale Reinforcement Learning without human intervention. Constitutional AI (CAI) is similar to RLHF except instead of human feedback, it learns through AI feedback.

38
Q

LLM Optimization Techniques

A

Distillation - train a smaller student model from a larger teacher model
Post Training Quantization (PTQ) - reduce precision of model weights (aka from 32 bits to 8 bits)
Pruning - remove model weights with values close to or equal to 0 (in theory makes sense to reduce , but in actuality there may not be many weights are Zero or close to 0

39
Q

3 types of issues of models

A

1) Out of date
2) Bad at math (can’t do calculation)
Hallucinations and guessing the answers it doesn’t know

40
Q

How to mitigate issues with models

A

RAG-retrieval augmented generating; get the details directly from a DB/API then pass I to the model

Chain of Thought –> Provide hints of how to break the problem into smaller parts [good for simple problems]

Program Aided Language Models (PAL)- integrate with python to do the math and model to call python

41
Q

Responsible AI - how do mitigate:

A

Toxicity [curating training data, train guardrail models, diverse group of human annotators], Hallucination [educate/add disclaimers]
Intellectual Property [not easy, machine ‘unlearning’, filtering blocking]`

42
Q

Existing metrics to measure hallucination

A

1) Rogue –> compare to an expected results
2) Ask Chat GPT to grade
3) Probability checks

43
Q

A new way to score LLM Hallucinations (Gailio)

A

ChainPull

Pass results to chain of thought’s model which includes a score and a logic path

44
Q
A
45
Q

What does GPT stand for?

A

Generative Pre-trained Transformers, commonly known as GPT, are a family of neural network models that uses the transformer architecture and is a key advancement in artificial intelligence (AI) powering generative AI applications such as ChatGPT.