05 - Deep Learning for Sequence Processing Flashcards
What is an encoder and a decoder?
An encoder is the input into a model being processed, whereas the decoder is the output of the model being generated.
What is meant by neural attention?
Here we talk about neural networks that automatically weight input relevance when making predictions - they attend to the inputs. This is advantageous as we get performance gains.
What is a ‘query vector’, q?
It is the query from the input sequence.
What are ‘keys’?
Keys are the elements that are being matched or compared against the query vector, q.
What are ‘values’? (Think keys, values, queries)
Values are the associated value to each key. The values represent context or meaning of the input elements.
How do we normally transform affinity scores into probabilities?
We use the softmax, or the soft argmax, transformation.
How do we compare a query, q and a key vector, h?
We use the dot product attention. The queries and keys need to have the same dimensionality.
Keys and values needs to be the same vectors. True or false?
False! They can correspond to different linear projections of the same vectors, but they do not need to be the same vector.
What is it called when we have multiple neural attention operations that are combined?
Multi-head attention.
What is the process of dot-product self-attention?
First we compute the query vectors.
Then we compute the key vectors.
Then we compute the values vectors.
Then we compute the query-key affinity scores with dot products.
We convert these scores to probabilities through softmax.
Finally we output the weighted average of the values.
How do we handle increasing complexity due to an increase in dimensionality of queries and keys?
We scale by the length of the query/key vectors. Remember: they have the same dimension.
Attention(Q,K,V) = ?
In the above, Q = queries, K = keys, V = values
softmax( (QK^T / sqrt(d_k)) ) * V
where d_k is the dimension of the token vectors.
What is masked self-attention?
It is similar to self-attention, except a part of the input is masked and then predicted at each step. For example if we are predicting the next word, we will want to mask that word that we are predicting.
Why is masked self-attention useful?
It helps us understand meaning and context in sentences where there are missing words, and thus increases our prediction power. It achieves this by focusing on the important words and the relationship with other words in that sentence.
Briefly explain how multi-head self-attention works.
In multi-head self-attention we split the input into smaller simultaneous sequences (on different splits) and perform self-attention. To get the final output we add or concatenate the outputs of each self-attention.
If we have 6 attention heads (multi-head self-attention) and a token embedding size of 600, how might we project tokens to each attention head?
By sizes of 100 within each attention head.
Why are transformers called transformers and what mechanism do they use to ‘transform’?
Because they utilize the transformer architecture. This archutecture uses self-attention to ‘transform’ input very effectively.
Sequence-to-sequence (seq2seq) are based on self-attention. True or false?
True
Transformers: Deep models with several layers scale quadratically with respect to sequence length. True or false?
True
The encoder layers does not use self-attention and feed-forward transformations. True or false?
False. They use both.
Decoder layers use masked self-attention for auto-regressive masking. True or false?
True
Decoder layers uses cross-attention to consider context from the encoder states. True or false?
True
What are encoder-only models good for? An example is the BERT.
They are really effective at solving classification and span prediction tasks ( question-answering, speech recognition etc.).
What is transfer learning?
In machine learning, transfer learning is when a model learns from one task and then uses that learning to help with another related task. It’s like using what you already know to make learning new things faster and easier.
What do we use encoder-decoder models for?
Generative and classification tasks. They are great for transfer learning.
What do we use decoder-only models for?
This is the technology behind the large generative language models, such as the Chat GPT.
What is ‘vector quantization’?
Vector quantization is a method of representing data by grouping similar vectors together and using a smaller set of representative vectors to approximate the original data.
What do we use the Audio Spectrogram Transformer (AST) for?
Recognizing speech commands, language identification, speaker identification, speech emotion recognition, environmental audio classification.
What is the ‘Mockingjay’ model?
It is a model that is inspired by BERT, and is pre-trained on a random subset of frames. The model tries to reconstruct an original frame.
What is the DiscreteBERT?
It is a model that uses discrete units as input rather than continuous audio features. It avoids local smoothness!
What is a ‘conformer’?
A conformer combines convolution layers to exploit local features with self-attention to model global context. It replaces the standard transformer blocks with ‘conformer blocks’.
Transformers for audio input are very much different from transformers with text input. True or false?
False. They are similar in many ways.
In what way is the original ‘wav2vec’ model trained?
Using contrastive predictive coding. And trained using self-supervised learning.
What is the essense ‘Constrastive Predictive Coding’?
The CPC learns the current context representation from speech data. It does so, by predicting the future audio segments from past samples.
The loss implicitly models the mutual information between the context representations and the future encoder feature vectors.
For these reasons, it is good for speaker recognition.
How does the ‘VQ-Wav2vec’ differ from the original ‘wav2vec’ model?
As the name infers, the VQ stands for Vector Quantization and uses this to transform acoustic features into discrete units.
What special trick does the quantization step in the ‘VQ-Wav2vec’ make use of?
It uses the Gumbel-softmax trick for straight-through estimation. This gives a dense representation.
Why is the HuBERT useful?
It is useful because it combines the benefits of both unsupervised and supervised learning, leveraging large-scale unlabeled data and labeled data to achieve state-of-the-art performance in speech recognition and related tasks.
What is the OpenSMILE toolbox used for?
It is an open-source audio feature extractor. It stands for Speech & Music Interpretation by Large-space Extraction. It is mostly used in the area of paralinguistics.
What is the ‘W2v-Conformer’ and why is it useful?
“W2v-Conformer” is useful because it combines the power of the Conformer architecture with the advantages of the word2vec pretraining approach, resulting in improved performance in automatic speech recognition and other natural language processing tasks.
Why is the ‘BigSSL’ useful?
(Big Self-Supervised Learning) is useful because it leverages large-scale self-supervised pretraining on unlabeled data, enabling the model to learn meaningful representations and achieve better performance across various downstream tasks such as text classification, machine translation, and question-answering.
What does the ‘W2v-BERT’ combine?
The contrastive loss of wav2vec with the masked language modelling of bert, without alternating objectives (like the HuBERT does).
What does the SpeechT5 stand for? (the 5 T’s)
Text-To-Text Transfer Transformer.
What is a ‘vocoder’?
Is is short for voice encoder and decoder and it is used to manipulate and synthesize speech and other audio.
The SpeechT5 uses cross-modal mapping through vector quantization. True or false?
True
What is the OpenAI Whisper model based on, or - what is the main task of the model?
Speech-to-text
What type of model is the OpenAI Whisper model and what is the primary use?
It is a sequence-to-sequence transformer model trained on many different speech tasks. The primary uses are all based on speech-to-text.
What is HuggingFace used for?
It is an open-source provider of NLP/ML technologies. It is also a library of transformer models to ude for example on top of pytorch.
When we have multi-lingual speech data we might use VALL-E and VALL-E X. True or false?
True. The X stands for Cross-lingual.
The VALL-E and VALL-E X uses EnCodec. What is this?
It is Convolutional auto-encoder. It creates audio from discrete tokens using this method.
What kind of transformers do VALL-E combine?
The main model combines autoregressive/non-autoregressive Transformers.