New ASR Flashcards
Whisper model details
Whisper is a 1.5 billion parameter sequence-to-sequence transformer model that has been pre-trained on 680,000 hours of weakly supervised speech recognition data. It shows a strong ability to generalize to many different datasets and domains. However, the size of such pre-trained ASR models can pose challenges in low-latency settings or resource-constrained hardware.
How can pseudo labelling be use for Whisper distillation?
Pseudo-labelling is a semi-supervised learning technique where a model is trained on a combination of labeled and unlabeled data. The model makes predictions on the unlabeled data, which are then treated as “pseudo-labels” and used in subsequent training iterations. The purpose of pseudo-labelling is to leverage the large amounts of unlabeled data to improve the performance of the model, especially when the amount of labeled data is limited. In the context of the Whisper model, pseudo-labelling was used to ensure consistent transcription formatting across the dataset and provide a sequence-level distillation signal.
how can pseudo labelling be used to distill large language models?
Pseudo labelling, also known as self-training, is a semi-supervised learning technique that is often used to improve the performance of a model when labeled data is scarce. It can be used to distill large language models in the following way:
Training the Teacher Model: Initially, a large language model (also known as the teacher model) is trained on a large corpus of text data. This model is usually very complex and computationally intensive.
Generating Pseudo Labels: Once the teacher model is trained, it is used to make predictions on an unlabeled dataset. These predictions are treated as “pseudo labels” for the unlabeled data.
Training the Student Model: A smaller, less complex model (also known as the student model) is then trained on the newly labeled data. The aim is for the student model to mimic the behavior of the teacher model, thus distilling the knowledge of the large model into the smaller one.
Iterative Process: This process can be repeated iteratively, each time using the student model to generate new pseudo labels and then re-training it. This allows the student model to gradually improve its performance.
The main advantage of this approach is that it allows the knowledge of a large, complex model to be distilled into a smaller, more efficient model. This can be particularly useful when deploying models in resource-constrained environments, where the computational complexity of the large model would be prohibitive.
It’s also worth mentioning that the success of this method depends heavily on the quality of the pseudo labels. If the teacher model makes a lot of incorrect predictions, these errors will be propagated to the student model. Hence, the initial teacher model should be as accurate as possible.
what are pros and cons of pseudolabelling for distillation use cases? how does it compare with other techniques?
Pseudo labeling has several advantages and disadvantages when it comes to distillation use cases:
Pros:
Efficient Use of Unlabeled Data: Pseudo labeling makes the most out of unlabeled data by generating pseudo labels. This can be particularly beneficial when labeled data is scarce or expensive to obtain.
Improved Model Performance: By leveraging the predictions of a more complex model (the teacher), pseudo labeling can help improve the performance of a simpler model (the student).
Resource Optimization: Pseudo labeling can help distill the knowledge of a large, complex model into a smaller, more efficient model, making it more feasible to deploy in resource-constrained environments.
Cons:
Error Propagation: If the teacher model makes incorrect predictions, these errors will be propagated to the student model through the pseudo labels, potentially degrading the performance of the student model.
Computationally Intensive: The process of training the teacher model, generating pseudo labels, and re-training the student model can be computationally intensive and time-consuming.
Dependence on Teacher Model: The success of pseudo labeling depends heavily on the quality of the teacher model. If the teacher model is not well-trained, the pseudo labels it generates may not be reliable.
Comparison with Other Techniques:
Other distillation techniques include methods like soft targets, attention transfer, and FitNets. Compared to these techniques, pseudo labeling is generally more straightforward to implement and does not require access to the internals of the teacher model (like gradients or intermediate layer activations).
However, these other techniques may provide more detailed guidance to the student model, potentially leading to better performance. For instance, soft targets can provide richer information by transferring the teacher’s entire output distribution, not just its predictions. Similarly, attention transfer and FitNets methods aim to make the student’s intermediate layer activations or attention maps similar to those of the teacher, providing a form of “internal guidance”.
In the end, the best technique to use can depend on factors like the specific task, the available computational resources, and the amount and quality of the available data.
Soft target == Knowledge distilaltion.
Is knowledge distillation being applied to ASR?
Knowledge Distillation has been applied to the ASR task with a focus on encoder-only models, such as Wav2Vec 2.0 and HuBERT, leading to substantial model compression and speed increase. However, this often comes at the cost of increased Word Error Rate (WER). Distillation of Seq2Seq ASR models, like LAS, has also been attempted, but with challenges in maintaining WER performance.
KD has also been applied to the ASR task, albeit with a focus on encoder-only models. Peng
et al. (2021) apply KD to the Wav2Vec 2.0 model (Baevski et al., 2020), achieving 79% model
compression and 59% increase in speed. However, these gains come at the expense of a 6.9%
increase to WER on the LibriSpeech corpus (Panayotov et al., 2015). Chang et al. (2021) apply a
similar method to the HuBERT model (Hsu et al., 2021), and too report a 7.0% WER increase. Pang
et al. (2018) attempt to distill LAS Chan et al. (2016), an early Seq2Seq ASR model, but find their
best distilled model performs 2.2% WER worse than its larger counterpart. This paper focuses on
KD of Seq2Seq models, with substantial model compression but also preserving WER performance
on OOD test data.
How has the Whisper model been distilled in previous studies?
Previous studies have focused on reducing the Whisper model’s size and memory footprint. Techniques such as Knowledge Distillation and Quantisation Aware Training (QAT) have been used to achieve significant parameter reduction with only marginal performance decrement. However, these approaches did not consider optimising the model for latency or robustness to different acoustic conditions.
Previous studies involving distilling the Whisper model have predominantly been centered around
reducing model size and memory footprint. Shao et al. (2023) applied KD in combination with
Quantisation Aware Training (QAT) (Jacob et al., 2017), demonstrating that significant parameter
reduction is possible with only marginal performance decrement. However, the student model is
trained and tested on a small corpus of ID data, giving no measure of its ability to generalise to
OOD data, and thus its robustness to different acoustic conditions (Geirhos et al., 2020; Radford
et al., 2022). F
What is the limitation of the Whisper model in processing audio inputs? how does it deal with longer segments?
Whisper models have a fixed receptive field corresponding to 30-seconds of input audio and cannot process longer audio inputs at once. While this isn’t a problem for most academic datasets that comprise short utterances less than 30-seconds, it can be a challenge for real-world applications such as meeting transcriptions that require transcribing long audio files of many minutes or hours.
The original Whisper paper presents a long-form transcription algorithm that sequentially transcribes 30-second segments of audio and shifts the sliding window according to the timestamps predicted by the model. This auto-regressive algorithm requires both beam-search and temperature fallback to ensure accurate long-form transcription.
What is Speculative Decoding (SD) and how does it accelerate the inference of auto-regressive transformer models?
Speculative Decoding is a method for accelerating the inference of auto-regressive transformer models by employing a faster assistant model. The assistant model generates a sequence of candidate tokens, all of which are verified by the main model in a single forward pass. The decoding process is sped up significantly by generating with the faster assistant model and only performing validation forward passes with the main model. The generated output exactly matches the sequence of tokens that would be generated by the main model, making it a natural replacement for existing inference pipelines that use the main model.
Speculative decoding is a technique to speed up the inference of large l
The output of the draft model are passed to the target model as inputs, along with the original inputs. The target model then uses its own parameters and logic to generate its own outputs, which may or may not match the draft model’s outputs. The target model does not validate or correct the draft model’s outputs directly, but rather uses them as a hint or a shortcut to speed up its own generation process. For example, if the draft model predicts that the next word in a sentence is “the”, the target model can use that information to narrow down its search space and focus on the words that are likely to follow “the”. However, the target model can also override the draft model’s output if it thinks that a different word is more appropriate or accurate. This way, the target model can produce the same outputs as if it was running alone, but faster1.
What is Low rank adaptation of large language models?
Low rank adaptation of large language models, or LoRA, is a technique that reduces the number of trainable parameters and the memory requirement for adapting pre-trained models to specific tasks or domains. It does this by freezing the original model weights and adding pairs of rank-decomposition matrices to each layer of the model, which are the only parameters that are trained. LoRA is often used for natural language processing tasks, such as text classification, natural language generation, and natural language understanding. LoRA can achieve comparable or better performance than full fine-tuning, which retrains all model parameters, while using much fewer resources. LoRA is also faster and more efficient than other adaptation methods, such as adapters, prefix-tuning, and PET. You can read more about LoRA in this paper1 or this blog post2. Some alternatives to LoRA are:
Adapters: These are small modules that are inserted between the layers of a pre-trained model and are trained while keeping the original weights fixed. Adapters can reduce the number of trainable parameters, but they introduce additional inference latency and complexity to the model architecture. You can learn more about adapters here.
Prefix-tuning (we also have suffix tuning): This is a method that trains a small neural network that is prepended to the input of a pre-trained model and modifies its input embeddings. Prefix-tuning can generate diverse and controllable outputs, but it requires a large amount of data and computation to train. You can learn more about prefix-tuning here.
PET: This stands for Pattern-Exploiting Training, and it is a technique that converts natural language understanding tasks into cloze-style questions that can be answered by a pre-trained language model. PET can improve the performance of pre-trained models, but it relies on the availability of high-quality patterns and verbalizers for each task. You can learn more about PET here.