LLM Modeling Flashcards
What model did we use in the class
Flan-T5
RNN
recurrent neural networks (PREVIOUS GEN) à only to word before it
LLM
large language models = all words to each other; and weight of attention/influence between the words
Tokenize
Convert each word into numbers (which is store in a vector)
Self Attention
analyzes the relationships between the tokens
Encoder
inputs prompts with contextual understand and outputs vector
Decoder
accepts inputs token and generators token
Sequence to Sequence
encoder to decoder model; Translation, text summarization, answering questions, is a sequence to sequence task (T5, BART)
Decoder only mode
good at generating text (GPT)
Zero Shot Inference
pass no grading/sentiment (type of In-context learning (ICL))
One shot Inference
pass one grading/sentiment (type of In-context learning (ICL))
Few shot inference
pass few grading/sentiment (type of In-context learning (ICL))
Greedy
always take the most probably word, which will have the outputs be the same over and over again
Big data
When LMM is too big for a single GPU
DDP
Distributed Data Parallel
Fully Shredded Data Parallel (FSDP)
BIGGER SCALE reduce memory by distributing/sharding model parameter across GPUs
Three main variables for scale
1) Constraints (GPU, time, cost) 2)Data set Size (number of token ) 3) Model Size (number of parameters)
Chinchilla
very large models may be over parameterized and under trained à thus less parameters but feed it more data verse bigger and bigger