P3 - Transformers Flashcards
What is the key difference between Transformers and RNNs + LSTMs?
Transformers eliminate sequential processing by handling all tokens simultaneously using self-attention mechanisms.
How do Transformers achieve parallel processing?
They represent the entire input sequence as vectors and apply attention mechanisms across all tokens at once, enabling high parallelization.
What role does positional encoding play in Transformer models?
Positional encodings add numerical representations to input embeddings, allowing the model to capture the order of tokens despite processing them in parallel.
How do Transformers maintain global context throughout the network?
Instead of relying on a single hidden state, Transformers maintain a representation of the entire sequence, capturing global relationships across all tokens.
What is the self-attention mechanism and why is it important?
Self-attention compares each token with every other token to determine relationships (based on semantic similarity and contextual relevance)
Enables the model to capture both local and global dependencies.
How does self-attention decide which tokens to focus on in a sentence?
It evaluates each token’s relevance by considering factors like semantic similarity and grammatical context, assigning higher attention scores to tokens that better match the context (e.g., linking “it” to “cat” rather than “mat”).