Terminology Flashcards
What is an epoch?
One pass of all training samples
What is a batch and batchsize?
The training samples are collected in smaller batches. The batchsize is the number of samples in one batch.
What does the number of steps/iterations mean?
The number of batches it takes to go though an epoch.
What is a checkpoint?
Different model versions saved throughout training
What is BERT?
Bidirectional Encoder Representations from Transformers.
BERT is an encoder-only model.
A pre-trained encoder stack that creates dynamic or contextualised word embeddings which can be fine-tuned for transfer learning.
What is GPT2?
GPT2 is a decoder-only model
What is warm-starting?
Composing an encoder-decoder model of pre-trained stand-alone model checkpoints is defined as warm-starting the encoder-decoder model
What is Fine-tuning?
The task-specific training of a model that has been initialized with the weights of a pre-trained language model
What defines a Pre-trained model?
It has been trained on unlabeled text data, i.e. in a task-agnostic, unsupervised fashion, and it processes a sequence of input words into a context-dependent embedding.
What is self-attention?
Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.
Overall, the self-attention mechanism allows the inputs to interact with each other (“self”). A self-attention module works by comparing every word in the sentence to every other word in the sentence, including itself, and reweighing the word embeddings of each word to include contextual relevance.
It adds contextual information to the words in the sentence. e.g. to provide the best representation of words with different meanings.
What is an RNN?
RNNs are specialized for processing sequential data where the order is meaningful. e.g. language.
Recurrent neural networks are unidirectional neural networks meaning they only work sequentially and introduce a hidden state to which is passed along the chain. E.g. saying the alphabet is easy going forward but not backwards because it is learned as a sequence.
What is a CNN?
CNNs are especially good for image data.
A convolution neural network can process input in parallel but has finite memory.
What stands T5 for?
T5 stands for “Text-to-Text Transfer Transformer”
What is the difference/similarity between BERT and mT5?
Both are based on transformers. BERT in itself is just an encoder so not default a text-to-text model as T5.
What is a Transformer architecture?
An attention-only sequence-to-sequence architecture introdcued by Vaswani et al. Advantages: Input sequence can be passed in parallel. A model with multiple attention heads processed in parallel with its own set of weight matrices allows the model to jointly attend to different attention focuses in the information from different representational aspects in each sub-layer. Uses positional encoding.