LLM - Adaptation & Prompting Flashcards

1
Q

What is the main part of encoder-decoder architecture for language models?

A

Self-attention modules/heads. They are usually stacked one on top of another. Each head contains self-attention module together with some normalization and feed-forward layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is encoder self-attention?

A

It is the attention in the encoder part of encoder-decoder architecture where each token attends to every other token (both sides, to future and past tokens)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is encoder-decoder attention?

A

At any step of decoder, it attends to precious computation of encoder.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is decoder self-attention?

A

The decoder should not know about the future tokens since its goal is to predicts the next token. Because of that, the future token attention has to be masked. At any step of decoder, it attends to decoder’s previous generations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why and how would you adapt language models?

A

You have a pre-trained language model that is pre-trained on massive amounts of data. They do not necessarily do useful things—they only complete sentences.

Now how to you ”adapt” them for your use-case?
▪Tuning: adapting (modifying) model parameters
▪Prompting: adapting model inputs (language statements)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some possible use cases for language models?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you fine tune the pre-trained language models?

A
  1. Whole model tuning: Run an optimization defined on your task data that updates all model parameters
  2. Head-tuning: Run an optimization defined onyour task data that updates the parameters of the model “head”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is parameter-efficient fine-tuning?

A

In contrast to fine-tuning all model parameters, this can be updating one particular part of the model, adding some layers to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are adapter in LM fine-tuning? Name 3 adapter-based fine tuning techniques.

A

Augmenting the existing pre-trained model with extra parameters or layers and training only the new parameters.

▪ Idea: train small sub-networks and only tune those. No need to store a full model for each task, only the adapter params.

Sparse adapters, Parallel adapters, AdaMix, S4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

▪ Is parameter-efficient tuning more
(1) computationally efficient
(2) memory-efficient
than whole-model tuning?

A

Answer to (1) It is not faster!
You still need to do the entire forward and backward pass.
Answer to (2) It is more memory efficient.
You only need to keep the optimizer state for parameters that you are fine-tuning and not all the parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are selective methods in LM fine-tuning? Name 3 techniques.

A

Selective methods fine-tune a subset of the existing parameters of the model. It could be a layer depth-based selection, layer type-based selection, or even individual parameter selection.

BitFit, Attention Tuning, LN Tuning, S4, Sparse Adapter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain BitFit technique for selective LM fine-tuning.

A

BitFit only tunes the bias terms in self-attention and MLP layers.

only updates about 0.05% of the model parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some limitation of fine-tuning?

A

Often you need a large labeled data
▪ Though more pre-training can reduce the need for labeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly