Transformer Neural Networks (TNNs) Flashcards
Brief introduction of TNNs
Introduced in a 2017 paper by Google employees, it is a network that allows for better performance through parallelization (capability of processing different sentences at once rather than one by one as with other neural networks) through the use of a self-attention mechanism.
Positional encodings
Mathematical values generated through the use of mathematical functions to indicate the position of a word in a sentence
Self-attention mechanism
This is when the TNN adds “weight” (a mathematical multiplier) to each word in a sentence based on importance relative to other words in the sentence. It represents this importance by producing varying attention weights.
Multi-head attention
This is when the TNN applies self-attention mechanism to different parts of a sentence simultaneously during processing, results in different perspectives on word relationships and interactions, which are later combined for improved accuracy.
TNN architecture (Example of translating a sentence from English to French word-by-word)
- Input embeddings - Each word in a sentence is turned into a vector that captures its meaning
- Positional encoding - Incorporate information about position of each word into same vector (retains the order of tokens)
- Encoder layers (around 6 layers) - Each layer includes self-attention mechanism, feed-forward layer, and output normalization - essentially here it focuses on accurately representing text
- Decoder layers (around 6 layers) - Each layer includes once again self-attention mechanism, feed-forward layer, and output normalization - only focuses on producing the next token
- Output layer -generates the next word in output sentence
How are attention weights obtained in self-attention mechanism?
We calculate attention weights based on the relationship of each word with every other word (grammatically, contextually, etc.)
For the actual calculations, we use vectors representing different grammatical patterns and relationships as well as the words.
Then we modify each vector based on this attention weight, all of which are usually generated in an attention weight table.
Residual Connections (Skip Connections)
This is when we can skip certain layers of the TNN when calculating the gradient in order to mitigate for vanishing gradient problem. The choice of layers to skip is based on experimentation, kind of like hypertuning.
Parallelization and Long-term dependencies (TNN vs. RNN) (needed for case study)
Parallelization:
RNNs process the input data sequentially, where each step depends on the output of the previous step making it impossible to parallelize. In contrast, TNNs use self-attention mechanisms that allow for all parts of an input sequence to be compared to each other simultaneously. This allows for much faster computation compared to sequential models like an RNN.
Long term dependencies:
TNNs use a self-attention mechanism that computes attention weights between every pair of tokens in the input sequence. This means that every word can directly be compared to every other word, regardless of their distance in the sequence, allowing the model to capture global dependencies. On the other hand, RNNs process tokens one at a time, passing hidden states from one step to the next. This can lead to the vanishing gradient problem. Global dependencies allows for enhanced language understanding in TNNs as it can understand the overall meaning and sentiment of a text by considering all contextual clues, not just local info.
Reduced training times and scalability (TNN vs. RNN) (needed for case study)
Reducing Training Times:
Due to parallelization, TNNs can process multiple tokens simultaneously and therefore, reduce the time needed for training. In contrast, RNN’s sequential nature inherently limits their training speed. This efficiency is crucial for training large models on large datasets.
Scalability:
TNNs can be scaled much more effectively because of their stackable architecture and efficient processing style using self-attention mechanism. Furthermore, computation of attention scores are independent, meaning distributed computing resources can be leveraged. On the other hand, RNNs are harder to scale because they process tokens sequentially, which is slower, and also because stacking layers makes them even more sequential and memory-intensive. This also leads to gradient instability and computational cost. This scalability that comes with TNNs enables them to tackle large datasets and complex tasks more effectively than RNNs.