lecture 10 Flashcards
What major AI breakthrough occurred in November 2022?
The introduction of ChatGPT.
What is the fundamental technology behind ChatGPT?
The Transformer model.
When was the Transformer model introduced?
In 2016.
What makes Transformer models powerful?
They use self-attention and scale effectively with large datasets.
What are sequence models used for?
Processing sequential data like language, time series, and speech.
What are two common sequence models before Transformers?
Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
What is a limitation of RNNs?
They require sequential processing, making them slow.
What is a limitation of CNNs for sequences?
They have a limited memory and cannot capture long-range dependencies effectively.
What advantage does self-attention provide?
It allows parallel processing and captures long-range dependencies.
What is the basic function of self-attention?
Each output is computed as a weighted sum of all input values.
How are self-attention weights determined?
They are computed dynamically from the input itself.
What is the primary benefit of self-attention?
It captures dependencies between all elements in a sequence efficiently.
What does the term ‘transformer’ refer to in deep learning?
A model architecture that relies on self-attention and feedforward layers.
What is a key benefit of Transformers over RNNs?
Transformers allow parallel computation, reducing training time.
What operation is central to self-attention?
Computing similarity between input elements to determine their importance.
What mathematical operation is used in self-attention?
Dot-product attention.
What mechanism normalizes attention weights?
The softmax function.
What does the softmax function do in self-attention?
It converts raw scores into probabilities that sum to 1.
What are the three main components of self-attention?
Query, Key, and Value matrices.
What does the Query (Q) matrix represent?
It captures how much attention an input should give to others.
What does the Key (K) matrix represent?
It determines how much an input should be attended to by others.
What does the Value (V) matrix represent?
It holds the actual information that will be aggregated.
What is scaled dot-product attention?
A modification of dot-product attention that scales down large values for stability.
What is the benefit of multi-head attention?
It allows the model to focus on different aspects of the sequence simultaneously.
What is positional encoding in Transformers?
A technique to introduce order information into the input sequence.
Why is positional encoding necessary?
Because self-attention does not inherently preserve word order.
What type of functions are used for positional encoding?
Sine and cosine functions with different frequencies.
What is the role of feedforward layers in Transformers?
They apply transformations to each position independently after self-attention.
What is the key takeaway from Transformers?
They revolutionized sequence processing by enabling efficient parallelization and long-range dependencies.