lecture 10 - LLMs Flashcards

Question

How is self-attention made uni-directional?

Answer 1

- By using masked self-attention: - A mask blocks attention to future words by zeroing out attention weights for those positions.

Answer 2

It ensures that each word can **only see words in the past**, preserving the sequential nature of language.

Answer 3

**Multi-headed self-attention** makes self-attention more expressive by using different QKV matrices for each head.

Answer 4

The outputs of all attention heads are concatenated and then transformed to produce the final self-attention output.

Answer 5

Types of relationships and long range dependencies

Answer 6

The Transformer block is a **modular design** that efficiently handles large models. It combines: 1. **Self-attention** for contextual relationships. 2. **Normalization layers** for stability. 3. **Residual connections** to preserve information flow.

Answer 7

1. Input. 2. Multi-head self-attention (with skip connection). 3. Normalization layer. 4. Fully connected neural network (with skip connection). 5. Normalization layer. 6. Output.

Answer 8

Residual connections add the input back into the output of specific layers, ensuring that the original information is retained.

Answer 9

Transformers are insensitive to order (permutation invariant), so position embeddings encode the order of words to generate coherent and sequential text.

Answer 10

Word embeddings represent the content of each word, while position embeddings encode their order.

Answer 11

Pros: 1. Linguistically meaningful. 2. Reduces input size (fewer tokens). Cons: 1. Results in large vocabulary. 2. Cannot handle new words or misspellings (out-of-vocabulary errors).

Answer 12

Pros: 1. Can handle any string (no out-of-vocab errors). 2. Requires a small vocabulary (e.g., a-z, punctuation). Cons: 1. Results in many x-vectors (quadratic attention cost). 2. Not linguistically meaningful.

Answer 13

Use sub-word tokenization (e.g., byte pair encoding): Words are split into smaller units (subwords) if necessary.

Answer 14

1. Operates mostly at the word level. 2. Can handle new words (never out-of-vocab). 3. Provides full control over vocabulary size.

Answer 15

LLMs split text into tokens rather than individual characters, so they cannot see letters within words.

Answer 16

LLM abilities are fundamentally tied to how input text is tokenized and understood as discrete units.

Answer 17

GPT predictions start rough at lower layers and are progressively refined through each layer of the transformer.

Answer 18

1. They are feedforward models and process thousands of words at once. 2. They have perfect working memory and do not need to summarize or abstract information like humans do. - i.e., **how** they learn

Answer 19

They can learn **complex relationships** and achieve (super-)human performance on language tasks. - i.e., **what** they learn

Answer 20

A masked language model predicts a missing/masked word rather than the next word. Examples: BERT, RoBERTa.

Answer 21

Pro: Can learn bi-directional relations (context from both sides of a word). Con: Cannot perform generation, mainly used for sentence embeddings or specialized tasks.

Answer 22

- Ignore intended instructions. - Mimic similar patterns from training data. - Often hallucinate responses unrelated to the task.

Answer 23

Three techniques were applied after training the base completion model: 1. **Instruction tuning**: Fine-tune on instruction-following examples. 2. **RLHF/DPO**: Fine-tune on human preferences (ranked outputs). 3. **Rule-based reward modeling**: Add rules to make outputs more helpful.

Answer 24

No, these techniques make the model **more helpful**, but they do not teach it new knowledge.

Answer 25

1. **Internal representations**: Contain linguistic information, abstractions, and representations. 2. **Model predictions**: Contain word predictabilities and linguistic expectations.

Answer 26

The brain compares incoming signals to top-down predictions evidence: predictable stimuli lead to weaker responses.

Answer 27

1. **LLMs as Predictors**: LLM output probabilities can serve as a quantitative measure of predictability. 2. Link to Human Cognition: Brain responses to predictable/unpredictable words align with LLM prediction probabilities. 3. LLMs can act as computational models for studying predictive processes in human brains.

Answer 28

Less predictable words require longer fixation times, indicating greater cognitive processing.

Answer 29

1. Humans are logarithmically sensitive to surprisal (unexpected words). 2. This sensitivity aligns with LLMs like GPT, which track subtle log-probability fluctuations of words.

Answer 30

1. Context-level prediction: Predict what word might come next based on context. 2. Lexical-level prediction: Narrow down possibilities to predict specific words.

Answer 31

Predictions occur continuously and hierarchically: They are not isolated to individual words but happen across multiple levels of meaning (context, lexical).

Answer 32

LLMs generate representations of: 1. Linguistic information. 2. Abstractions. 3. Relationships between words.

Answer 33

LLMs can be used to: - Encode and decode brain responses to language. - i.e., Relate LLM activations (vector representations) to BOLD signals (from fMRI).

Answer 34

- LSTM activations outperform basic embeddings in some brain regions. - Transformer activations outperform basic embeddings in some brain regions.

Answer 35

The better a language model predicts the next word, the better it predicts brain activity.

Answer 36

Different layers predict brain activity in distinct regions of the brain during language processing.

Answer 37

1. Encoding Model: Maps words (e.g., "It was a sunny day") to brain responses recorded via fMRI. 2. Decoding Model: Reconstructs words from these brain activity patterns.

Answer 38

1. A language model proposes potential word continuations: p(words). 2. An encoding model scores a likelihood for each: p(BOLD∣words). 3. Brain responses (BOLD signal) determine the best match between predictions and observed activity.

Answer 39

Stimuli such as text or visual activity can be decoded into meaningful textual descriptions.

Answer 40

While predictions may give a numerical match between model outputs and brain responses, this match **does not inherently provide insight into the brain's workings**.

Answer 41

Brain prediction can be used as a tool, but it does not provide understanding on its own.

Answer 42

By using brain predictions, researchers can disentangle syntax and semantics, showing that these components are processed differently in the brain and the model.

Answer 43

LLMs can be used in two ways: 1. Using LLM probabilities (more interpretable and constrained). 2. Using LLM representations (powerful but harder to interpret).

Answer 44

1. **Human language processing is predictive**: Humans are sensitive to the negative log probability of words, the same signal used by LLMs. 2. LLM representations can predict and decode brain responses to language very accurately but **require additional analyses for deeper insight**.