Lesson 8 - Transfer Learning Flashcards

1
Q

There are three ways to transfer learning. Name them.

A

1) Pre-trained models as feature extractors
2) Adapting a pre-trained model (aka fine-tuning)
3) Extracting Relevant Information (aka distillation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you use a pre-trained model as a feature extractor to transfer learning?

A
  • take pre-trained model
  • extract activations from given parts of the network
  • train a standard ML method based on the extracted activation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some problems or limitations with the technique to use pre-trained models as feature extractors?

A

We assume:
1) the model encodes/knows useful features for the new problem
OR
2) combinations of these features can be used to represent new data for new problem

IF NOT –> NO GOOD PERFORMANCE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you use the technique to adapt a pre-trained model (aka fine-tuning)?

A
  • adjust final layer
  • update the weights of some layers to adopt to new tasks
  • freeze some weights, retain others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When adapting a pre-trained model (aka fine-tuning), why don’t we freeze the fully connected layers (tail)?

A

Doing that means you already have weights you cannot change. That means you have to update all the weights in the convolutional layers in order to be compatible with the weights that are fixed in the fully connected part

–> much more complex problem
–> some classes could become unreachable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When doing the extraction of relevant information technique (aka distillation), what are the 4 directions you could go?

A

1) Parameter pruning and sharing = Network Quantization
2) Identify Redundant Parameters = low-rank factorization
3) Compression of convolutional filters
4) Knowledge Distillation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of extracting relevant information (aka distillation)?

A

Move relevant information into a smaller model so you need less resources and you are faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an important architecture used for extracting relevant information (aka distillation)?

A

The Teacher-Student architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is network quantization and what are some benefits of this technique?

A
  • use different data types to represent your weights
  • you loose precision but gain memory
  • with smaller data types comes more highly optimized operations/functionality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

There are three variants when extracting relevant information (aka distillation). Which three?

A

1) response-based knowledge extraction
–> mimics the output

2) features-based knowledge extraction
–> mimics the output and intermediate states

3) relation based knowledge extraction
–> exploit relationships between feature maps or data samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the setting of extracting relevant information on a feature-based knowledge. Why do we need to pre-process these features before they can be compared by the loss function?

A

Pre-processing so that features from the teacher and student are mapped to a space where they are comparable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the setting of extracting relevant information with a response-based knowledge. Why is it more preferred to look at the logits in stead of the class labels/predictions?

A

In the logits you have a distribution output where we see what the model believes about the different possibilities
==> if we only focus on the maximum, we loose context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the lottery ticket hypothesis?

A

Training succeeds for a given network if one of its subnetworks has been randomly initialized such that it can be trained in isolation to high accuracy in at most the number of iterations necessary to train the original network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the lottery ticket hypothesis, briefly?

A
  • given a large amount of possible network initialization
  • the network successfully trains because a sub-network was initialized properly
  • this sub-network can reach similar performance to that of the complete model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the algorithm to finding the winning ticket?

A

1) randomly initialize a neural network
2) train the network until it converges
3) prune a fraction of the network
4) reset the weights of the remaining portion of the network to their values from 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly