Terminology Flashcards by Katrine Mamsen

What is an epoch?

One pass of all training samples

How well did you know this?

Not at all

Perfectly

What is a batch and batchsize?

The training samples are collected in smaller batches. The batchsize is the number of samples in one batch.

How well did you know this?

Not at all

Perfectly

What does the number of steps/iterations mean?

The number of batches it takes to go though an epoch.

How well did you know this?

Not at all

Perfectly

What is a checkpoint?

Different model versions saved throughout training

How well did you know this?

Not at all

Perfectly

What is BERT?

Bidirectional Encoder Representations from Transformers.
BERT is an encoder-only model.
A pre-trained encoder stack that creates dynamic or contextualised word embeddings which can be fine-tuned for transfer learning.

How well did you know this?

Not at all

Perfectly

What is GPT2?

GPT2 is a decoder-only model

How well did you know this?

Not at all

Perfectly

What is warm-starting?

Composing an encoder-decoder model of pre-trained stand-alone model checkpoints is defined as warm-starting the encoder-decoder model

How well did you know this?

Not at all

Perfectly

What is Fine-tuning?

The task-specific training of a model that has been initialized with the weights of a pre-trained language model

How well did you know this?

Not at all

Perfectly

What defines a Pre-trained model?

It has been trained on unlabeled text data, i.e. in a task-agnostic, unsupervised fashion, and it processes a sequence of input words into a context-dependent embedding.

How well did you know this?

Not at all

Perfectly

What is self-attention?

Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.

Overall, the self-attention mechanism allows the inputs to interact with each other (“self”). A self-attention module works by comparing every word in the sentence to every other word in the sentence, including itself, and reweighing the word embeddings of each word to include contextual relevance.

It adds contextual information to the words in the sentence. e.g. to provide the best representation of words with different meanings.

How well did you know this?

Not at all

Perfectly

What is an RNN?

RNNs are specialized for processing sequential data where the order is meaningful. e.g. language.
Recurrent neural networks are unidirectional neural networks meaning they only work sequentially and introduce a hidden state to which is passed along the chain. E.g. saying the alphabet is easy going forward but not backwards because it is learned as a sequence.

How well did you know this?

Not at all

Perfectly

What is a CNN?

CNNs are especially good for image data.

A convolution neural network can process input in parallel but has finite memory.

How well did you know this?

Not at all

Perfectly

What stands T5 for?

T5 stands for “Text-to-Text Transfer Transformer”

How well did you know this?

Not at all

Perfectly

What is the difference/similarity between BERT and mT5?

Both are based on transformers. BERT in itself is just an encoder so not default a text-to-text model as T5.

How well did you know this?

Not at all

Perfectly

What is a Transformer architecture?

An attention-only sequence-to-sequence architecture introdcued by Vaswani et al. Advantages: Input sequence can be passed in parallel. A model with multiple attention heads processed in parallel with its own set of weight matrices allows the model to jointly attend to different attention focuses in the information from different representational aspects in each sub-layer. Uses positional encoding.

How well did you know this?

Not at all

Perfectly

How is a GPU an advantage for the summariser?

For the mT5 summariser, it includes neural networks, which utilizes matrix multiplication (MM). MM can be processing in parallel using block MM. Cuda takes care of this allocation of memory and hyperparameters concerning this.

What is an LSTM?

Long short term memory.
Allows past information to skip a lot of processing, so memory can be retained for longer sequences.
Solves some of the problems with RNNs but are slow and complex since they work serially and therefore not good for GPUs.

(Transfer learning not very good for LSTM. You need a new labeled data set specific to your task everytime.)

What is the encoder-decoder attention layer?

Attends to specific parts of the input in relation to the output.

What is the Text Rank method?

Using each sentence as a node and then and edge between two nodes if they have a similarity above a certain threshold. The more edges between nodes the more important is the node.

What is word embedding?

Word embedding algorithms turn words into vector spaces that capture a lot of the meaning/semantic information of the words (e.g. king - man + woman = queen).
E.g. GloVe or continuous neural net like this one (in our project).

What is attention?

Attention allows the model to focus on the relevant parts of the input sequence as needed.

Introduced since the context vector (the last hidden state of the RNN encoder passed to the decoder) was problematic for longer sentences.
Now all hidden states are passed to the decoder model. The decoder scores the hidden states i.e. which are more relevant.

What word embeddings does the mT5 use?

Contextualised embeddings (see Ross’) slides on “die”

What is an encoder?

An encoder processes the input sequence and compresses the information into a context vector also called sentence embedding.
A summary of the meaning in a sequence.

What is a decoder?

A decoder is initialized with the context vector to emit the transformed output.

what is bag-of-words?

One-hot encoding words. The downfall is that it is simple and does not capture the order of words or grammar. Ant is not computationally efficient since the vector needs to be all words combined / longest sequence.

How does the mT5/transformer represent words?

A defining feature of neural network language models is their representation of words as high dimensional real-valued vectors. As continuous space word representations. So-called contextualized embeddings. Not static.

Why non-linear weights?

Without non-linearities, deep neural networks can’t do anything more than a linear transform. With more layers, they can approximate more complex functions! Non-linearities allow for better decision boundaries in more complex scenarios.

What is the distributional hypothesis?

This link between similarity in how words are distributed and similarity in what they mean.

What is static word embedding?

Word2vec embeddings are static embeddings, meaning that the method learns one fixed embedding for each word in the static embeddings vocabulary. Also Skip-gram is a static word embedding that trains a logistic regression classifier to compute the probability that two words are ‘likely to occur nearby in text’.

What are contextualised embeddings?

Each word will be represented by a | different vector each time it appears in a different context.

What is the vanishing gradient problem?

Involves the weights in earlier layers of the network. SGD uses the gradient value computed to update the weights. If gradients are small aka vanishing, the weights won't change. The weights become stuck. (Same with exploding gradients.) Represents a general problem with unstable gradients. Specifically, gradient-based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases such as long-distance dependencies in language.

What is an encoder-decoder model architecture?

Encoding means converting data into a required format. Put together with a decoder meaning converting a coded message into intelligible language

What is a T5 model exactly?

T5 uses a unified “text-to-text” format which is natural for generative tasks but not as natural for classification tasks where T5 will have to output the string label and not an index. T5 is trained similarly to BERT on masked language tasks, but sometimes replace multiple consecutive tokens with a single Mask keyword → since the model needs to output text it is designed to target a sequence.