lecture 10 - LLMs Flashcards
What is a language model?
A model (DNN or otherwise) that computes the probability P(w|context), where:
- “Context” typically refers to the previous words.
- A more general definition is P(symbolcontext).
What does the probability P(w|context) represent
It represents the probability of a string, given the previous or surrounding strings.
What are neural language models?
They compute P(w|context) using a neural network to predict words based on input context
Why train a language model?
- (sometimes) for word or string prediction
- (usually) as a pre-training tasks: Training to predict words allows the model to learn general patterns and the structure of language (transfer of learning).
What did Ada Lovelace recognize about machines and prediction?
Ada Lovelace, considered the first computer programmer, recognized that machines could go beyond calculations to generalized problem solving.
What are the benefits of learning from prediction?
- Prediction is challenging and invites learning at many levels.
- Prediction enables training on near limitless amounts of data.
What are embeddings in GPT language models?
Embeddings are numerical representations of words, symbols, or other data represented as vectors that capture meaning or relationships between words.
Why are embeddings important in language models?
Embeddings:
- Contain a rich internal structure despite their fuzziness that enables success.
- Use distributed representation: Information is spread across many dimensions.
- Can have arbitrary dimensions (hundreds or thousands).
- Exhibit graded relationships: Words with similar meanings have similar embeddings.
What are the six steps in the general setup of ANN language modeling?
- Input: The model receives an input sentence (e.g., “The students opened their MacBooks”).
- Tokens: Words are represented as numerical IDs.
- Embedding: Each token is converted into an embedding vector.
- Model: The DNN learns patterns in language to predict the most likely next word.
- Output: The model generates probabilities over the vocabulary for each token position.
- Target: The model compares predictions with the correct token IDs and minimizes error.
What role do embeddings play in deep neural networks (DNNs)?
Embeddings are the input representations that allow DNNs to learn patterns and relationships in language, enabling the model to predict the next word or perform other language tasks.
What is the general setup for ANN language modeling?
Sequence of words
to sequence of probability distributions for p(w∣context).
How does ANN language modeling generate predictions and calculate loss?
- Input: The context (e.g., “the students opened their”).
- Prediction vector p(w∣context): The model generates a vector of probabilities over the vocabulary.
- Target vector (w∣context): A one-hot vector, where 1 marks the correct word position.
- Loss: Negative log-likelihood loss=−log(p(w∣context)).
- If the model assigns high probability, loss is small.
If the model assigns low probability, loss is large.
What is negative log-likelihood loss in ANN models?
- A loss function used to quantify prediction error
- loss=−log(p(w∣context))
- It penalizes low probabilities assigned to the correct word.
What is the Transformer architecture, and what does it replace?
The Transformer is a neural network architecture introduced to replace traditional models like RNNs and LSTMs for sequence tasks. It relies heavily on self-attention mechanisms.
What are the two main components of the Transformer architecture?
- encoder
- decoder
How do modern Transformers (e.g., GPT) simplify the original structure?
Modern Transformers simplify the structure by focusing on either the encoder or the decoder (e.g., GPT uses primarily the decoder).
What is self-attention in neural networks?
Self-attention computes interactions between all inputs and produces outputs as a weighted sum of the inputs.
This is deterministic (no learnable parameters)
What are the learnable parameters in self-attention?
Each input vector x is used in three ways:
- Query (Q): To compute w for itself.
- Key (K): To compute w for other vectors.
- Value (V): As input to the weighted sum.
How are Queries, Keys, and Values (Q, K, V) derived in self-attention?
Q, K, and V apply different linear transformations to the input vectors x
What is the matrix format of self-attention?
- Input X is transformed into Queries (Q), Keys (K), and Values (V).
- The attention mechanism computes a weighted sum using the softmax of K^T Q, divided by sqrt(d_k)
- final output: V * softmax
Why is self-attention highly parallelizable?
All words (vectors) are processed in one sweep, enabling parallel computations.
What are the key properties of self-attention?
- Fully parallel: Matrix operations allow all computations at once.
- No problem looking far back: Captures relationships far back in the input.
- Relies on input embeddings: Outputs depend heavily on input vectors.
What is the problem with self-attention for language models (LMs)?
- Insensitive to order: Self-attention is permutation invariant, so input order doesn’t matter.
- All vectors see all others: Without constraints, inputs can see future words, problematic for causal tasks.
How can self-attention be made position sensitive?
By adding position embeddings to input vectors.
At the first layer, x becomes a vector combination of:
- Embedding of word (content).
- Embedding of position (position in the sequence).
How is self-attention made uni-directional?
- By using masked self-attention:
- A mask blocks attention to future words by zeroing out attention weights for those positions.
Why is masked self-attention important for causal tasks?
It ensures that each word can only see words in the past, preserving the sequential nature of language.
How is self-attention made expressive?
Multi-headed self-attention makes self-attention more expressive by using different QKV matrices for each head.