Lesson 8 - Transfer Learning Flashcards

Question 1

Q

There are three ways to transfer learning. Name them.

Answer

A

1) Pre-trained models as feature extractors
2) Adapting a pre-trained model (aka fine-tuning)
3) Extracting Relevant Information (aka distillation)

Question 2

Q

How do you use a pre-trained model as a feature extractor to transfer learning?

Answer

A

take pre-trained model
extract activations from given parts of the network
train a standard ML method based on the extracted activation

Question 3

Q

What are some problems or limitations with the technique to use pre-trained models as feature extractors?

Answer

A

We assume:
1) the model encodes/knows useful features for the new problem
OR
2) combinations of these features can be used to represent new data for new problem

IF NOT –> NO GOOD PERFORMANCE

Question 4

Q

How do you use the technique to adapt a pre-trained model (aka fine-tuning)?

Answer

A

adjust final layer
update the weights of some layers to adopt to new tasks
freeze some weights, retain others

Question 5

Q

When adapting a pre-trained model (aka fine-tuning), why don’t we freeze the fully connected layers (tail)?

Answer

A

Doing that means you already have weights you cannot change. That means you have to update all the weights in the convolutional layers in order to be compatible with the weights that are fixed in the fully connected part

–> much more complex problem
–> some classes could become unreachable

Question 6

Q

When doing the extraction of relevant information technique (aka distillation), what are the 4 directions you could go?

Answer

A

1) Parameter pruning and sharing = Network Quantization
2) Identify Redundant Parameters = low-rank factorization
3) Compression of convolutional filters
4) Knowledge Distillation

Question 7

Q

What is the goal of extracting relevant information (aka distillation)?

Answer

A

Move relevant information into a smaller model so you need less resources and you are faster

Question 8

Q

What is an important architecture used for extracting relevant information (aka distillation)?

Answer

A

The Teacher-Student architecture

Question 9

Q

What is network quantization and what are some benefits of this technique?

Answer

A

use different data types to represent your weights
you loose precision but gain memory
with smaller data types comes more highly optimized operations/functionality

Question 10

Q

There are three variants when extracting relevant information (aka distillation). Which three?

Answer

A

1) response-based knowledge extraction
–> mimics the output

2) features-based knowledge extraction
–> mimics the output and intermediate states

3) relation based knowledge extraction
–> exploit relationships between feature maps or data samples

Question 11

Q

In the setting of extracting relevant information on a feature-based knowledge. Why do we need to pre-process these features before they can be compared by the loss function?

Answer

A

Pre-processing so that features from the teacher and student are mapped to a space where they are comparable

Question 12

Q

In the setting of extracting relevant information with a response-based knowledge. Why is it more preferred to look at the logits in stead of the class labels/predictions?

Answer

A

In the logits you have a distribution output where we see what the model believes about the different possibilities
==> if we only focus on the maximum, we loose context

Question 13

Q

What is the lottery ticket hypothesis?

Answer

A

Training succeeds for a given network if one of its subnetworks has been randomly initialized such that it can be trained in isolation to high accuracy in at most the number of iterations necessary to train the original network

Question 14

Q

What is the lottery ticket hypothesis, briefly?

Answer

A

given a large amount of possible network initialization
the network successfully trains because a sub-network was initialized properly
this sub-network can reach similar performance to that of the complete model

Question 15

Q

What is the algorithm to finding the winning ticket?

Answer

A

1) randomly initialize a neural network
2) train the network until it converges
3) prune a fraction of the network
4) reset the weights of the remaining portion of the network to their values from 1)