lecture 3 - reinforcement learning Flashcards

Question

What role does the ventral tegmental area (VTA) play in Temporal Difference (TD) learning?

Answer 1

The VTA contains dopaminergic neurons that project to the striatum, playing a **key role in reward-based learning by signaling prediction errors**.

Answer 2

When there is no prediction, the conditioned stimulus (CS) does not elicit a response. If a reward occurs, dopamine neuron activity increases, signaling a positive prediction error.

Answer 3

Dopamine activity increases when the CS appears (predicting the reward), but there is no change when the reward itself is received, as it was already expected.

Answer 4

Dopamine activity increases when the CS appears but decreases (a dip in activity) when the reward fails to occur, signaling a negative prediction error.

Answer 5

The dopamine response shifts from the reward to the CS over time, as the CS comes to predict the reward. This demonstrates TD learning's ability to adjust expectations based on timing and prediction.

Answer 6

Dopaminergic neurons in the VTA signal the difference between expected and actual rewards (prediction error), adjusting learning and behavior to align with future expectations.

Answer 7

- adds action - models for both reward learning and value-based decision-making in humans - a type of reinforcement learning algorithm that helps agents learn the value of taking specific actions in given states to maximize rewards over time

Answer 8

1. Models **(temporal) reward learning** similar to TD/RW learning 2. Models **decision-making** similar to DDM

Answer 9

the value of the stimulus

Answer 10

1. initialize Q-table 2. choose an action a 3. perform action 4. measure reward 5. update Q - repeat steps 2-5 until training is done - over time, the Q-value for each action is updated as the agent learns from rewards received after taking actions

Answer 11

- (Q_chosen) - (Q_unchosen) - if ΔQ is negative, the unchosen stimulus had the highest Q value.

Answer 12

- determines the probability of selecting a particular option t based on its Q-value. - i.e., decides which option to pick based on the Q-values calculated by the value function

Answer 13

- temperature parameter, which controls the exploration-exploitation balance - controls how sensitive the model’s choices are to Q-value differences

Answer 14

- more exploration (random choice) - smooth sigmoid curve for x=Qvalue, y=p(correct) - fluctuating Q-values since it learns from a wider range of experiences - fluctuating RPEs due to random exploratory behavior

Answer 15

- more exploitation (deterministic, highest Q-value) - sharp sigmoid curve for x=Qvalue, y=p(correct) - smooth Q-value evolution - smooth RPE evolution since the model sticks to predictable decisions based on the highest Q-value.

Answer 16

the exploration-exploitation balance can be influenced by both traits (individual differences in decision-making styles) and choices (specific decisions made during a task).

Answer 17

updates the "Q-value" for the chosen action based on whether the feedback was positive or negative 1. If you pick an option and get a reward, the Q-value increases based on a learning rate (α_gain) 2. If you pick an option and get no reward, the Q-value adjusts downward using a different learning rate (α_loss)

Answer 18

1. a_{gain}: speed of learning from positive feedback 2. a_{loss}: speed of learning from negative feedback - i.e., these determine the speed of learning/updating Q-values

Answer 19

Q-values adjust more quickly, leading to larger swings in value estimates

Answer 20

results in slower updates, producing smoother, more gradual changes in Q-values

Answer 21

- speeds learning up - Q-value differences reach asymptote earlier in the trials, showing faster differentiation between choice options

Answer 22

- leads to slower learning - Q-value take longer to reach their asymptotes. the Q-values differences line line rises more gradually, indicating slower value differentiation

Answer 23

- makes the model ‘unlearn’ Q-values quickly after a negative outcome - the asymptotes in the Q-values go down - the Q-value differences line becomes more jagged, as strong updates from negative feedback create more fluctuations in learning

Answer 24

leads to negative feedback having less influence, so Q-values decrease more slowly after non-rewards. this makes Q-value differences smoother and less volatile over time.

Answer 25

- difference between expected and received feedback - also drives learning

Answer 26

- **sequence of trials** and **specific trial outcomes** (e.g., rewards and the precise pattern of reward prediction errors) - this randomness reflects the fact that a single parameter setting generates data from a distribution of possible outcomes rather than a fixed dataset.

Answer 27

- **Q-value differences** and **evolution of reward prediction errors** - these consistent patterns confirm that the parameter settings shape the **underlying probabilities** governing learning and decision-making.

Answer 28

- an agent interacts with an environment by taking actions and receiving rewards based on the state of the environment. - This approach works well for simple problems with a limited number of states but becomes infeasible as the environment's complexity grows.

Answer 29

used for categorization problems, where a (supervised) classifier learns to categorize inputs based on labeled examples.

Answer 30

- combines deep learning and reinforcement learning to handle **complex RL problems**. - Instead of using a tabular approach, DRL uses deep neural networks to approximate the action-value function or policy, making it **scalable to environments with high-dimensional state spaces**.

lecture 3 - reinforcement learning Flashcards

(54 cards)