lecture 10 - LLMs Flashcards

1
Q

What is a language model?

A

A model (DNN or otherwise) that computes the probability P(w|context), where:

  • “Context” typically refers to the previous words.
  • A more general definition is P(symbolcontext).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the probability P(w|context) represent

A

It represents the probability of a string, given the previous or surrounding strings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are neural language models?

A

They compute P(w|context) using a neural network to predict words based on input context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why train a language model?

A
  1. (sometimes) for word or string prediction
  2. (usually) as a pre-training tasks: Training to predict words allows the model to learn general patterns and the structure of language (transfer of learning).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What did Ada Lovelace recognize about machines and prediction?

A

Ada Lovelace, considered the first computer programmer, recognized that machines could go beyond calculations to generalized problem solving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the benefits of learning from prediction?

A
  1. Prediction is challenging and invites learning at many levels.
  2. Prediction enables training on near limitless amounts of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are embeddings in GPT language models?

A

Embeddings are numerical representations of words, symbols, or other data represented as vectors that capture meaning or relationships between words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are embeddings important in language models?

A

Embeddings:

  1. Contain a rich internal structure despite their fuzziness that enables success.
  2. Use distributed representation: Information is spread across many dimensions.
  3. Can have arbitrary dimensions (hundreds or thousands).
  4. Exhibit graded relationships: Words with similar meanings have similar embeddings.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the six steps in the general setup of ANN language modeling?

A
  1. Input: The model receives an input sentence (e.g., “The students opened their MacBooks”).
  2. Tokens: Words are represented as numerical IDs.
  3. Embedding: Each token is converted into an embedding vector.
  4. Model: The DNN learns patterns in language to predict the most likely next word.
  5. Output: The model generates probabilities over the vocabulary for each token position.
  6. Target: The model compares predictions with the correct token IDs and minimizes error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What role do embeddings play in deep neural networks (DNNs)?

A

Embeddings are the input representations that allow DNNs to learn patterns and relationships in language, enabling the model to predict the next word or perform other language tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the general setup for ANN language modeling?

A

Sequence of words
to sequence of probability distributions for p(w∣context).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does ANN language modeling generate predictions and calculate loss?

A
  1. Input: The context (e.g., “the students opened their”).
  2. Prediction vector p(w∣context): The model generates a vector of probabilities over the vocabulary.
  3. Target vector (w∣context): A one-hot vector, where 1 marks the correct word position.
  4. Loss: Negative log-likelihood loss=−log(p(w∣context)).
  • If the model assigns high probability, loss is small.
    If the model assigns low probability, loss is large.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is negative log-likelihood loss in ANN models?

A
  • A loss function used to quantify prediction error
  • loss=−log(p(w∣context))
  • It penalizes low probabilities assigned to the correct word.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Transformer architecture, and what does it replace?

A

The Transformer is a neural network architecture introduced to replace traditional models like RNNs and LSTMs for sequence tasks. It relies heavily on self-attention mechanisms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two main components of the Transformer architecture?

A
  1. encoder
  2. decoder
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do modern Transformers (e.g., GPT) simplify the original structure?

A

Modern Transformers simplify the structure by focusing on either the encoder or the decoder (e.g., GPT uses primarily the decoder).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is self-attention in neural networks?

A

Self-attention computes interactions between all inputs and produces outputs as a weighted sum of the inputs.

This is deterministic (no learnable parameters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the learnable parameters in self-attention?

A

Each input vector x is used in three ways:

  1. Query (Q): To compute w for itself.
  2. Key (K): To compute w for other vectors.
  3. Value (V): As input to the weighted sum.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How are Queries, Keys, and Values (Q, K, V) derived in self-attention?

A

Q, K, and V apply different linear transformations to the input vectors x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the matrix format of self-attention?

A
  1. Input X is transformed into Queries (Q), Keys (K), and Values (V).
  2. The attention mechanism computes a weighted sum using the softmax of K^T Q, divided by sqrt(d_k)
  3. final output: V * softmax
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why is self-attention highly parallelizable?

A

All words (vectors) are processed in one sweep, enabling parallel computations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the key properties of self-attention?

A
  1. Fully parallel: Matrix operations allow all computations at once.
  2. No problem looking far back: Captures relationships far back in the input.
  3. Relies on input embeddings: Outputs depend heavily on input vectors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the problem with self-attention for language models (LMs)?

A
  1. Insensitive to order: Self-attention is permutation invariant, so input order doesn’t matter.
  2. All vectors see all others: Without constraints, inputs can see future words, problematic for causal tasks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can self-attention be made position sensitive?

A

By adding position embeddings to input vectors.

At the first layer, x becomes a vector combination of:

  1. Embedding of word (content).
  2. Embedding of position (position in the sequence).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How is self-attention made uni-directional?

A
  • By using masked self-attention:
  • A mask blocks attention to future words by zeroing out attention weights for those positions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why is masked self-attention important for causal tasks?

A

It ensures that each word can only see words in the past, preserving the sequential nature of language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How is self-attention made expressive?

A

Multi-headed self-attention makes self-attention more expressive by using different QKV matrices for each head.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How are the outputs of multi-headed self-attention combined?

A

The outputs of all attention heads are concatenated and then transformed to produce the final self-attention output.

29
Q

What roles can different attention heads play?

A

Types of relationships and long range dependencies

30
Q

What is the Transformer block?

A

The Transformer block is a modular design that efficiently handles large models. It combines:

  1. Self-attention for contextual relationships.
  2. Normalization layers for stability.
  3. Residual connections to preserve information flow.
31
Q

What are the steps in the Transformer block?

A
  1. Input.
  2. Multi-head self-attention (with skip connection).
  3. Normalization layer.
  4. Fully connected neural network (with skip connection).
  5. Normalization layer.
  6. Output.
32
Q

What is the purpose of residual connections in the Transformer block?

A

Residual connections add the input back into the output of specific layers, ensuring that the original information is retained.

33
Q

Why are position embeddings added to word embeddings in GPT models?

A

Transformers are insensitive to order (permutation invariant), so position embeddings encode the order of words to generate coherent and sequential text.

34
Q

What do word and position embeddings represent in GPT models?

A

Word embeddings represent the content of each word, while position embeddings encode their order.

35
Q

What are the pros and cons of using words for embeddings?

A

Pros:

  1. Linguistically meaningful.
  2. Reduces input size (fewer tokens).

Cons:

  1. Results in large vocabulary.
  2. Cannot handle new words or misspellings (out-of-vocabulary errors).
36
Q

What are the pros and cons of character-based embeddings?

A

Pros:

  1. Can handle any string (no out-of-vocab errors).
  2. Requires a small vocabulary (e.g., a-z, punctuation).

Cons:

  1. Results in many x-vectors (quadratic attention cost).
  2. Not linguistically meaningful.
37
Q

What is the solution to balance word and character-based embeddings?

A

Use sub-word tokenization (e.g., byte pair encoding):

Words are split into smaller units (subwords) if necessary.

38
Q

What are the pros of sub-word tokenization?

A
  1. Operates mostly at the word level.
  2. Can handle new words (never out-of-vocab).
  3. Provides full control over vocabulary size.
39
Q

Why can’t large LLMs see letters within words?

A

LLMs split text into tokens rather than individual characters, so they cannot see letters within words.

40
Q

Why is tokenization important when evaluating LLM abilities?

A

LLM abilities are fundamentally tied to how input text is tokenized and understood as discrete units.

41
Q

How are GPT predictions refined over layers?

A

GPT predictions start rough at lower layers and are progressively refined through each layer of the transformer.

42
Q

Why are Transformers not cognitively plausible compared to humans?

A
  1. They are feedforward models and process thousands of words at once.
  2. They have perfect working memory and do not need to summarize or abstract information like humans do.
  • i.e., how they learn
43
Q

What makes Transformers cognitively plausible in some aspects?

A

They can learn complex relationships and achieve (super-)human performance on language tasks.

  • i.e., what they learn
44
Q

What is a masked language model (MLM), and what is an example?

A

A masked language model predicts a missing/masked word rather than the next word. Examples: BERT, RoBERTa.

45
Q

What are the pros and cons of masked language models (MLMs)?

A

Pro: Can learn bi-directional relations (context from both sides of a word).

Con: Cannot perform generation, mainly used for sentence embeddings or specialized tasks.

46
Q

What are the limitations of pure GPT models (e.g., GPT-3)?

A
  • Ignore intended instructions.
  • Mimic similar patterns from training data.
  • Often hallucinate responses unrelated to the task.
47
Q

How were pure GPT models adapted into chat-based LLMs like ChatGPT?

A

Three techniques were applied after training the base completion model:

  1. Instruction tuning: Fine-tune on instruction-following examples.
  2. RLHF/DPO: Fine-tune on human preferences (ranked outputs).
  3. Rule-based reward modeling: Add rules to make outputs more helpful.
48
Q

Does post-training (instruction tuning, RLHF) teach ChatGPT new knowledge?

A

No, these techniques make the model more helpful, but they do not teach it new knowledge.

49
Q

How are LLMs used to study language processing in cognitive science?

A
  1. Internal representations: Contain linguistic information, abstractions, and representations.
  2. Model predictions: Contain word predictabilities and linguistic expectations.
50
Q

What is the key idea behind the brain as a prediction machine?

A

The brain compares incoming signals to top-down predictions

evidence: predictable stimuli lead to weaker responses.

51
Q

How can LLM probabilities help study prediction in cognitive science?

A
  1. LLMs as Predictors: LLM output probabilities can serve as a quantitative measure of predictability.
  2. Link to Human Cognition: Brain responses to predictable/unpredictable words align with LLM prediction probabilities.
  3. LLMs can act as computational models for studying predictive processes in human brains.
52
Q

What effect does word predictability have on reading?

A

Less predictable words require longer fixation times, indicating greater cognitive processing.

53
Q

What is the relationship between humans and LLMs regarding word surprisal?

A
  1. Humans are logarithmically sensitive to surprisal (unexpected words).
  2. This sensitivity aligns with LLMs like GPT, which track subtle log-probability fluctuations of words.
54
Q

What are the two levels of hierarchical prediction in language comprehension?

A
  1. Context-level prediction: Predict what word might come next based on context.
  2. Lexical-level prediction: Narrow down possibilities to predict specific words.
55
Q

What is the hierarchical nature of human predictions in language?

A

Predictions occur continuously and hierarchically:

They are not isolated to individual words but happen across multiple levels of meaning (context, lexical).

56
Q

What internal representations do LLMs generate during processing?

A

LLMs generate representations of:

  1. Linguistic information.
  2. Abstractions.
  3. Relationships between words.
57
Q

How can LLMs be used to study brain responses to language?

A

LLMs can be used to:

  • Encode and decode brain responses to language.
  • i.e., Relate LLM activations (vector representations) to BOLD signals (from fMRI).
58
Q

How do LSTM, transformer activations and word embeddings compare in explaining brain activity?

A
  • LSTM activations outperform basic embeddings in some brain regions.
  • Transformer activations outperform basic embeddings in some brain regions.
59
Q

What is the relationship between language models (LMs) and brain predictivity?

A

The better a language model predicts the next word, the better it predicts brain activity.

60
Q

How do different layers of a language model predict brain activity?

A

Different layers predict brain activity in distinct regions of the brain during language processing.

61
Q

What is an application example of decoding meaning using brain patterns?

A
  1. Encoding Model: Maps words (e.g., “It was a sunny day”) to brain responses recorded via fMRI.
  2. Decoding Model: Reconstructs words from these brain activity patterns.
62
Q

How does decoding work for language models and brain responses?

A
  1. A language model proposes potential word continuations: p(words).
  2. An encoding model scores a likelihood for each: p(BOLD∣words).
  3. Brain responses (BOLD signal) determine the best match between predictions and observed activity.
63
Q

What kind of stimuli can be decoded into meaningful text using LLM-based models?

A

Stimuli such as text or visual activity can be decoded into meaningful textual descriptions.

64
Q

What is a key limitation of embedding-based brain predictions?

A

While predictions may give a numerical match between model outputs and brain responses, this match does not inherently provide insight into the brain’s workings.

65
Q

How can embedding-based brain prediction still be useful?

A

Brain prediction can be used as a tool, but it does not provide understanding on its own.

66
Q

How can researchers dissociate syntax and semantics in brain data?

A

By using brain predictions, researchers can disentangle syntax and semantics, showing that these components are processed differently in the brain and the model.

67
Q

What makes LLMs powerful tools for studying language processing in humans?

A

LLMs can be used in two ways:

  1. Using LLM probabilities (more interpretable and constrained).
  2. Using LLM representations (powerful but harder to interpret).
68
Q

What insight does existing research provide about language processing?

A
  1. Human language processing is predictive: Humans are sensitive to the negative log probability of words, the same signal used by LLMs.
  2. LLM representations can predict and decode brain responses to language very accurately but require additional analyses for deeper insight.