Knowledge distillation Flashcards

1
Q

What are some knowledge, aspect of the model that can be transferred to another model?

A

A vanilla knowledge
distillation uses the logits of a large deep model as
the teacher knowledge

The activations, neurons, or features of intermediate
layers
also can be used as the knowledge to guide the
learning of the student model

The relationships between different activations, neurons or pairs of
samples contain rich information learned by the teacher
model

Furthermore, the parameters of the teacher model (or the
connections between layers
) also contain another knowledge

3 categories

We discuss different forms of knowledge in the following categories: response-based knowledge, feature-based knowledge, and relation-based knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 category or type of knowledge that can be transfered among models?

A

the following categories: response-based knowledge,

feature-based knowledge, and

relation-based knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Response-Based Knowledge

A

Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications. Given a vector of logits z as the outputs of the last fully connected layer of a deep model, the distillation loss for response-based knowledge can be formulated as

LResD(zt, zs) = LR(zt, zs)

usually zt and zs come after softmax so they

employs KullbackLeibler divergence loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Feature-Based Knowledge?

A

Deep neural networks are good at learning multiple levels of feature representation with increasing abstraction.

the output of the last layer and the output of intermediate layers, i.e., feature maps, can be used as the knowledge to supervise the training of the student model

. The main idea is to directly match the feature activations of the teacher and the student. you can do it in several ways like learning the best layer to match, shared parameters and much more.

If they are not the same shape you have to apply transformation matrices

Losses: l2/l1-norm distance

cross-entropy loss and maximum mean discrepancy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of distilaltion and losses uses distil BERT?

A

Transformer-layer Distillation. The proposed Transformer-layer distillation includes the attention based distillation and hidden states based distillation,

MSE

Embedding-layer Distillation. Similar to the hidden states based distillation, we also perform embedding-layer distillation

Prediction-layer Distillation

Specifically, we penalize the soft cross-entropy loss between the student network’s logits against the teacher’s logits: Lpred = CE(zT/t,zS/t), (10) where zS and zT are the logits vectors predicted by the student and teacher respectively, CE means the cross entropy loss, and t means the temperature value. In our experiment, we find that t = 1 performs well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

3 Relation-Based Knowledge

A

Both response-based and feature-based knowledge use the outputs of specific layers in the teacher model. Relation-based knowledge further explores the relationships between different layers or data samples.

You can use

The FSP matrix summarizes the relations between pairs of feature maps. It is calculated using the inner products between features from two layers.

Specifically, LEM (.), LH(.), LAW (.) and k.kF are Earth Mover distance, Huber loss, Angle-wise loss and Frobenius norm, respectively. Although some types of relation-based knowledge are provided recently, how to model the relation information from feature maps or data samples as knowledge still deserves further study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Possible distillation schemes? When to do the distillation

A
  • Offline Distillation

Most of previous knowledge distillation methods work offline

Therefore, the whole training process has two stages, namely: 1) the large teacher model is first trained on a set of training samples before distillation; and 2) the teacher model is used to extract the knowledge in the forms of logits or the intermediate features, which are then used to guide the training of the student model during distillation.

The main advantage of offline methods is that they are simple and easy to be implemented

Online Distillation

To overcome the limitation of offline distillation, online distillation is proposed to further improve the performance of the student model, especially when a large-capacity high performance teacher model is not available

In online distillation, both the teacher model and the student model are updated simultaneously, and the whole knowledge distillation framework is end-to-end trainable.

one example

multiple neural networks work in a collaborative way. Any one network can be the student model and other models can be the teacher during the training process.

Self-Distillation

In self-distillation, the same networks are used for the teacher and the student models

This can be regarded as a special case of online distillation.

proposed a new self-distillation method, in which knowledge from the deeper sections of the network is distilled into its shallow sections.

Besides, offline, online and self distillation can also be intuitively understood from the perspective of human beings teacher-student learning. Offline distillation means the knowledgeable teacher teaches a student knowledge; online distillation means both teacher and student study together with each other; self-distillation means student learn knowledge by oneself. Moreover, just like the human beings learning, these three kinds of distillation can be combined to complement each other due to their own advantages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are possible student teacher architecture?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Knowledge Distillation in Speech Recognition

A

To satisfy these requirements, knowledge distillation is widely studied and applied in many speech recognition tasks. There are many knowledge distillation systems for designing lightweight deep acoustic models for speech recognition

Most existing knowledge distillation methods for speech recognition, use teacher-student architectures to improve the efficiency and recognition accuracy of acoustic models

Some applications:

Using a recurrent neural network (RNN) for holding the temporal information from speech sequences, the knowledge from the teacher RNN acoustic model is transferred into a small student DNN model (Chan et al., 2015).

Better speech recognition accuracy is obtained by combining multiple acoustic modes. The ensembles of different RNNs with different individual training criteria are designed to train a student model through knowledge transfer.

A lot of application in transfer knowledge of models working well on long sequence to model working well on short sequences.

Or you can transfer multimodal learned representation into an audio only architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Observation on speech recognition Knowledge distillation

A
  • The lightweight student model can satisfy the practical requirements of speech recognition, such as real-time responses, use of limited resources and high recognition accuracy.
  • Many teacher-student architectures are built on RNN models because of the temporal property of speech sequences. In general, the RNN models are chosen as the teacher, which can well preserve and transfer the temporal knowledge from real acoustic data to a student model.
  • Sequence-level knowledge distillation can be well applied to sequence models with good performance. In fact, the frame-level KD always uses the response-based knowledge, but sequence-level KD usually transfers the feature-based knowledge from hint layers of teacher models.
  • Knowledge distillation using teacher-student knowledge transfer can easily solve the cross-domain or cross-modal speech recognition in applications such as multi-accent and multilingual speech recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some similarities of response-based knowledge with other techniques?

A

For example, the response-based knowledge has a similar motivation to label smoothing and the model regularization (Kim and Kim, 2017; Muller et al., 2019; Ding et al., 2019);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the so called quantized distillation?

A

Network quantization reduces the computation complexity of neural networks by converting high-precision networks (e.g., 32-bit floating point) into low-precision networks (e.g., 2-bit and 8-bit). Meanwhile, knowledge distillation aims to train a small model to yield a performance comparable to that of a complex model. Some KD methods have been proposed using the quantization process in the teacher-student framework (Polino et al., 2018; Mishra and Marr, 2018; Wei et al., 2018; Shin et al., 2019; Kim et al., 2019a). A framework for quantized distillation methods is shown in Fig. 15. Specifically, Polino et al. (2018) proposed a quantized distillation method to transfer the knowledge to a weight-quantized student network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is lifelong distillation?

A

Lifelong learning, including continual learning, continuous learning and meta-learning, aims to learn in a similar way to human. It accumulates the previously learned knowledge and also transfers the learned knowledge into future learning (Chen and Liu, 2018). Knowledge distillation provides an effective way to preserve and transfer learned knowledge without catastrophic forgetting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross-Modal Distillation

A

The data or labels for some modalities might not be available during training or testing. For this reason, it is important to transfer knowledge between different modalities. Several typical scenarios using cross-modal knowledge transfer are reviewed as follows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Multi-Teacher Distillation

A

n Different teacher architectures can provide their own useful knowledge for a student network. The multiple teacher networks can be individually and integrally used for distillation during the period of training a student network. In a typical teacher-student framework, the teacher usually has a large model or an ensemble of large models. To transfer knowledge from multiple teachers, the simplest way is to use the averaged response from all teachers as the supervision signa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Adversarial Distillation

A

Are there other ways of training the student model in order to mimic the teacher model? Recently, adversarial learning has received a great deal of attention due to its great success in generative networks

Specifically, the discriminator in a GAN estimates the probability that a sample comes from the training data distribution while the generator tries to fool the discriminator using generated data samples. Inspired by this, many adversarial knowledge distillation methods have been proposed to enable the teacher and student networks to have a better understanding of the true data distribution

You can use an adversarial generator to create hard examples to train on.

Or you can use a discriminator loss on the data generated by the student and the teacher if you want to use unlabeled data.

recap

s: GAN is an effective tool to enhance the power of student learning via the teacher knowledge transfer; joint GAN and KD can generate the valuable data for improving the KD performance and overcoming the limitations of unusable and unaccessible data; KD can be used to compress GAN