LLM - Long Contexts, RAG Flashcards
What are the limitations of instruction tuning?
- Difficult to collect diverse labeled data
- Rate learning (token by token) —
▪ limited creativity - Agnostic to model’s knowledge —
▪ may encourage hallucinations
What is reinforcement learning?
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards or penalties to maximize cumulative rewards over time.
How does reinforcement learning help language models?
For the given query, LM provides two options for response and asks the user which one is better (optionally also by how much on a scale 1-5 or something)
How can we estimate the reward for RL?
First of all, we want to estimate it because we don’t want to use human feedback because it is very costly to ask humans all the time. Alternatively, we can build a model to mimic the user preference.
- Collecting user-annotated data:
- 1. approach would be to get humans to provide absolute scores for each output. Challenge is that human judgments on different instances and by different people can be noisy and mis-calibrated!
- 2. approach is to ask for pairwise comparison (is A or B better?) - Using this data, we can train a reward model
▪ The reward model returns a scalar reward which should numerically represent the human preference - Using this model, we teach the LM agains the rewards that the RL model returns.
- Periodically train the reward model with more samples and human feedback.
What would be a problem with training a reward model and training LM using it? How to solve it?
LM will learn to produce an output that would get a high reward but might be
gibberish or irrelevant to the prompt.
Solution: add a penalty term that penalizes too much deviations from the distribution of the pre-trained LM. This prevents the policy model from diverging too far from the pretrained model.
What are some consideration when working with Transformer LMs and long inputs?
Length generalization: Do Transformers work
accurately on long inputs?
Efficiency considerations: How efficient are LMs with long inputs?
Why scaling up LMs for longer context size is not feasible?
memory usage and number of operations in Self-Attention increases
quadratically
What is Sparse Attention Pattern?
It is one of the solutions for the efficiency consideration of LMs.
The idea is to make the attention operation sparse by limiting what tokens can attend to other tokens. Some ideas might include that only tokens nearby are able to attend to the token, or some other, more random patterns are also explored
What are some Sparsity Patterns?
What are Retrieval-based Language Models
They are LMs that retrieve information from an external datastore (at least during inference time)
How can we solve problem of LMs not having the up-to-date information?
By using retrieval-based LMs.
Why use Retrieval-based LMs?
LLMs can’t memorize all (long-tail) knowledge in their parameters
LLMs’ knowledge is easily outdated and hard to update
LLMs’ output is challenging to interpret and verify (add link to the source of info)
What is a process of RAG?
- Have an IR part (like neural retrivals to retrieve documents based on the query). By using some nearest neighbour search, we can retrieve top k chunks relevant to answer the query.
- Include this either as an input context of the LM or in the middle somewhere but also at the end (decision done by engineers)
Give me two variants how can retrieval-augmented LM look like
- Retrieve chunks of text (passages) once based on the user query, include it in the input layer and do the rest normally.
- Split the query into multiple parts, retrieve for each of those parts some chunks and include the embeddings of those chunks in the decoder part between self-attention and the FFN. (decision made by the engineers)
How can RAG be trained?
- end-to-end: train both retriever and LM together
- Freeze some parts and train the other parts