[NLP] Lecture 2: Large Language Models (Anna Rogers) Flashcards
What kind of LM do we have?
Autoregressive Language Models:
Predict the next token based on previous tokens
Masked Language Models:
- Predict masked (hidden) tokens within a sequence
- Can use both left and right context to make predictions
DIfference between corpus model and language model?
It would hep to call it corpus model, so it is more obvious the model is based on an exact corpus, so we don’t think it is not biased, compared to calling it language model, we “forget” it is not just langauge, it is trained on a specfific corpus.
Explain difference between pre-training and fine tuning
Pre-training: Pre-taining is not labelled, it is trained with regressive and masking (BERT), this is the base model
Fine-tuning: To make to do something other than predicting tokens, we fine tune it for at task
Biggest difference is the size of the available data
What happens during fine tuning?
Final layers gets changed the most
What is pre-fine-tuning
An intermediate stage between pre and fine-tuning.
What is instruction tuning?
It is trained on 20 different text tasks before fine-tuning (the T5 model)
What is few shot learning?
Give examples in the prompt
What is instruction tuning and RLHF?
Instruction tuning focuses on teaching the model to follow instructions
RLHF uses human feedback to refine the model’s understanding of what constitutes a good response
Instruction tuning is about capability, RLHF is about aligning the model with human values and expectations
Explain basics about ChatGPT
- Dialogue version og InstructGPT
- New OpenAI in-house data (humans both writing and rating model response)
- New ranking data for RLHF
- Keeps changing under the haude
- We dont know anything else about the models
What is RAG
Retrieval augmented generation, a way to find out where the model it got the information. Bing does it, it provides sources
Why do the LM get better the bigger they are?
As long as you add more weights and data set, they will get better “neural scaling laws”
WHat is data contamination?
When the model has seen something in the training data, and we test it on something similar?
What is emergent properties
When a model can do something, it is not trained on. It is difficult to day, because we don’t know how much data it is allowed to see, to say it has been trained on
What is the Eliza effect?
Caveat 2: Finetuning vs few-show performance
since GPT-3 most big models were presented with few-shot evaluations only