LLM - Adaptation & Prompting Flashcards

Question 1

Q

What is the main part of encoder-decoder architecture for language models?

Answer

A

Self-attention modules/heads. They are usually stacked one on top of another. Each head contains self-attention module together with some normalization and feed-forward layers.

Question 2

Q

What is encoder self-attention?

Answer

A

It is the attention in the encoder part of encoder-decoder architecture where each token attends to every other token (both sides, to future and past tokens)

Question 3

Q

What is encoder-decoder attention?

Answer

A

At any step of decoder, it attends to precious computation of encoder.

Question 4

Q

What is decoder self-attention?

Answer

A

The decoder should not know about the future tokens since its goal is to predicts the next token. Because of that, the future token attention has to be masked. At any step of decoder, it attends to decoder’s previous generations

Question 5

Q

Why and how would you adapt language models?

Answer

A

You have a pre-trained language model that is pre-trained on massive amounts of data. They do not necessarily do useful things—they only complete sentences.

Now how to you ”adapt” them for your use-case?
▪Tuning: adapting (modifying) model parameters
▪Prompting: adapting model inputs (language statements)

Question 6

Q

What are some possible use cases for language models?

Question 7

Q

How do you fine tune the pre-trained language models?

Answer

A

Whole model tuning: Run an optimization defined on your task data that updates all model parameters
Head-tuning: Run an optimization defined onyour task data that updates the parameters of the model “head”

Question 8

Q

What is parameter-efficient fine-tuning?

Answer

A

In contrast to fine-tuning all model parameters, this can be updating one particular part of the model, adding some layers to the model.

Question 9

Q

What are adapter in LM fine-tuning? Name 3 adapter-based fine tuning techniques.

Answer

A

Augmenting the existing pre-trained model with extra parameters or layers and training only the new parameters.

▪ Idea: train small sub-networks and only tune those. No need to store a full model for each task, only the adapter params.

Sparse adapters, Parallel adapters, AdaMix, S4

Question 10

Q

▪ Is parameter-efficient tuning more
(1) computationally efficient
(2) memory-efficient
than whole-model tuning?

Answer

A

Answer to (1) It is not faster!
You still need to do the entire forward and backward pass.
Answer to (2) It is more memory efficient.
You only need to keep the optimizer state for parameters that you are fine-tuning and not all the parameters.

Question 11

Q

What are selective methods in LM fine-tuning? Name 3 techniques.

Answer

A

Selective methods fine-tune a subset of the existing parameters of the model. It could be a layer depth-based selection, layer type-based selection, or even individual parameter selection.

BitFit, Attention Tuning, LN Tuning, S4, Sparse Adapter

Question 12

Q

Explain BitFit technique for selective LM fine-tuning.

Answer

A

BitFit only tunes the bias terms in self-attention and MLP layers.

only updates about 0.05% of the model parameters

Question 13

Q

What are some limitation of fine-tuning?

Answer

A

Often you need a large labeled data
▪ Though more pre-training can reduce the need for labeled data

LLM - Adaptation & Prompting Flashcards

(13 cards)