LLM - Adaptation & Prompting Flashcards
What is the main part of encoder-decoder architecture for language models?
Self-attention modules/heads. They are usually stacked one on top of another. Each head contains self-attention module together with some normalization and feed-forward layers.
What is encoder self-attention?
It is the attention in the encoder part of encoder-decoder architecture where each token attends to every other token (both sides, to future and past tokens)
What is encoder-decoder attention?
At any step of decoder, it attends to precious computation of encoder.
What is decoder self-attention?
The decoder should not know about the future tokens since its goal is to predicts the next token. Because of that, the future token attention has to be masked. At any step of decoder, it attends to decoder’s previous generations
Why and how would you adapt language models?
You have a pre-trained language model that is pre-trained on massive amounts of data. They do not necessarily do useful things—they only complete sentences.
Now how to you ”adapt” them for your use-case?
▪Tuning: adapting (modifying) model parameters
▪Prompting: adapting model inputs (language statements)
What are some possible use cases for language models?
How do you fine tune the pre-trained language models?
- Whole model tuning: Run an optimization defined on your task data that updates all model parameters
- Head-tuning: Run an optimization defined onyour task data that updates the parameters of the model “head”
What is parameter-efficient fine-tuning?
In contrast to fine-tuning all model parameters, this can be updating one particular part of the model, adding some layers to the model.
What are adapter in LM fine-tuning? Name 3 adapter-based fine tuning techniques.
Augmenting the existing pre-trained model with extra parameters or layers and training only the new parameters.
▪ Idea: train small sub-networks and only tune those. No need to store a full model for each task, only the adapter params.
Sparse adapters, Parallel adapters, AdaMix, S4
▪ Is parameter-efficient tuning more
(1) computationally efficient
(2) memory-efficient
than whole-model tuning?
Answer to (1) It is not faster!
You still need to do the entire forward and backward pass.
Answer to (2) It is more memory efficient.
You only need to keep the optimizer state for parameters that you are fine-tuning and not all the parameters.
What are selective methods in LM fine-tuning? Name 3 techniques.
Selective methods fine-tune a subset of the existing parameters of the model. It could be a layer depth-based selection, layer type-based selection, or even individual parameter selection.
BitFit, Attention Tuning, LN Tuning, S4, Sparse Adapter
Explain BitFit technique for selective LM fine-tuning.
BitFit only tunes the bias terms in self-attention and MLP layers.
only updates about 0.05% of the model parameters
What are some limitation of fine-tuning?
Often you need a large labeled data
▪ Though more pre-training can reduce the need for labeled data