Lesson 8 - Transfer Learning Flashcards
There are three ways to transfer learning. Name them.
1) Pre-trained models as feature extractors
2) Adapting a pre-trained model (aka fine-tuning)
3) Extracting Relevant Information (aka distillation)
How do you use a pre-trained model as a feature extractor to transfer learning?
- take pre-trained model
- extract activations from given parts of the network
- train a standard ML method based on the extracted activation
What are some problems or limitations with the technique to use pre-trained models as feature extractors?
We assume:
1) the model encodes/knows useful features for the new problem
OR
2) combinations of these features can be used to represent new data for new problem
IF NOT –> NO GOOD PERFORMANCE
How do you use the technique to adapt a pre-trained model (aka fine-tuning)?
- adjust final layer
- update the weights of some layers to adopt to new tasks
- freeze some weights, retain others
When adapting a pre-trained model (aka fine-tuning), why don’t we freeze the fully connected layers (tail)?
Doing that means you already have weights you cannot change. That means you have to update all the weights in the convolutional layers in order to be compatible with the weights that are fixed in the fully connected part
–> much more complex problem
–> some classes could become unreachable
When doing the extraction of relevant information technique (aka distillation), what are the 4 directions you could go?
1) Parameter pruning and sharing = Network Quantization
2) Identify Redundant Parameters = low-rank factorization
3) Compression of convolutional filters
4) Knowledge Distillation
What is the goal of extracting relevant information (aka distillation)?
Move relevant information into a smaller model so you need less resources and you are faster
What is an important architecture used for extracting relevant information (aka distillation)?
The Teacher-Student architecture
What is network quantization and what are some benefits of this technique?
- use different data types to represent your weights
- you loose precision but gain memory
- with smaller data types comes more highly optimized operations/functionality
There are three variants when extracting relevant information (aka distillation). Which three?
1) response-based knowledge extraction
–> mimics the output
2) features-based knowledge extraction
–> mimics the output and intermediate states
3) relation based knowledge extraction
–> exploit relationships between feature maps or data samples
In the setting of extracting relevant information on a feature-based knowledge. Why do we need to pre-process these features before they can be compared by the loss function?
Pre-processing so that features from the teacher and student are mapped to a space where they are comparable
In the setting of extracting relevant information with a response-based knowledge. Why is it more preferred to look at the logits in stead of the class labels/predictions?
In the logits you have a distribution output where we see what the model believes about the different possibilities
==> if we only focus on the maximum, we loose context
What is the lottery ticket hypothesis?
Training succeeds for a given network if one of its subnetworks has been randomly initialized such that it can be trained in isolation to high accuracy in at most the number of iterations necessary to train the original network
What is the lottery ticket hypothesis, briefly?
- given a large amount of possible network initialization
- the network successfully trains because a sub-network was initialized properly
- this sub-network can reach similar performance to that of the complete model
What is the algorithm to finding the winning ticket?
1) randomly initialize a neural network
2) train the network until it converges
3) prune a fraction of the network
4) reset the weights of the remaining portion of the network to their values from 1)