lecture 3 - reinforcement learning Flashcards

1
Q

value-based decision making

A

used because many decisions are about objective stimulus values instead of subjective preferences like before (DDM, SDT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

reinforcement learning in AI

A

how do agents learn to behave in an environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

reinforcement learning in psychology

A

how do humans learn from rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

basic mechanisms of interest for RL in neural models for cognitive processes

A
  1. decision-making
  2. learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

classical conditioning

A
  • a process where a new stimulus-response connection is formed through association, allowing an agent to associate a previously neutral stimulus with an unconditioned response.
  • conditioning happens even without action from an agent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

key stages of classical conditioning

A
  1. Before Conditioning: Unconditioned stimulus (US) elicits an unconditioned response (UR). Neutral stimulus (NS) produces no response.
  2. During Conditioning: Neutral stimulus (NS) is paired with US, leading to UR.
  3. After Conditioning: The NS becomes a conditioned stimulus (CS), eliciting a conditioned response (CR).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

acquisition

A

the process by which a neutral stimulus gains associative value, leading to a learned response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

extinction

A

the learned association can weaken and eventually disappear if the conditioned stimulus is not reinforced by the unconditioned stimulus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

‘kamin’ blocking

A

a previously learned association prevents the formation of a new association with a second stimulus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

rescorla-wagner model

A

The model aims to predict rewards or punishments by learning from the prediction error (𝛿), which is the difference between the expected reward and the actual reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

rescorla-wagner model: formula

A

prediction error (δ) = actual reward - predicted reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

delta-rule

A

how much learning occurs on each trial, based on;

  1. prediction error
  2. learning rate
  3. stimulus salience
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

delta-rule: formula

A

ΔV=αβ(λ−ΣV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

name the components: ΔV=αβ(λ−ΣV)

A

ΔV: Amount of learning on a given trial.
α: Learning rate.
β: Salience of the stimulus.
λ: Asymptote of learning (maximum value).
ΣV: Total amount learned so far (expectation).

(λ−ΣV): prediction error (δ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

delta rule: prediction error

A
  • difference between the value of the feedback (λ) and the current expectation (ΣV) based on prior experience
  • when the prediction error is large, learning occurs more rapidly

if δ is large, the prediction error is larger, and learning occurs more rapidly on the trial than when e.g., δ = 0, as this would indicate no needed adjustment (perfect prediction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

delta rule: value of the learning rate

A
  • determines the steepness of the learning curve
  1. higher α leads to faster increases in association strength.
  2. lower α results in slower learning over trials.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

delta rule: λ

A
  • asymptote of learning (value of CS)
  • larger λ leads to steeper curve, as the initial prediction error will be larger
  • extinction happens when λ = 0 (downward curve)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

delta rule: What happens as ΣV approaches λ over trials

A

As the sum of learned values (ΣV) approaches the asymptote (λ), the prediction error decreases, learning slows down, and eventually stops when the asymptote is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

limitation of rescorla-wagner model

A
  • RW model is designed to predict only the immediate reward based on the current conditioned stimulus (CS)
  • doesn’t capture more complicated situations such as higher order conditioning, which involves learning complex associations that extend beyond immediate, single-step predictions
  • limits its ability to model scenarios where rewards are delayed or involve a sequence of predictive cues.
  • we therefore need to predict all possible future rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

temporal difference learning

A
  • extends RW model to cover all the time steps in the trial with an eligibility trace
  • we are now not only predicting the immediate reward but also accounting for future rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

eligibility trace

A
  • bookkeeping of all times in the trial
  • projects a value back in time, which means that it can adjust earlier state values based on rewards that occur later in a sequence
  • enables the model to learn from delayed rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does the eligibility trace improve learning in complex scenarios?

A

It allows the model to adjust earlier state values based on rewards that occur later, making it well-suited for sequential learning and tasks where actions in one state influence future rewards.

23
Q

How is an eligibility trace versatile?

A

It depends not only on time but also on the state of the world, allowing it to adapt to different learning contexts and scenarios.

24
Q

How does Temporal Difference (TD) learning extend the Rescorla-Wagner model?

A

TD learning accounts for rewards occurring at different times in a trial, updating state values incrementally

25
Q

What role does the ventral tegmental area (VTA) play in Temporal Difference (TD) learning?

A

The VTA contains dopaminergic neurons that project to the striatum, playing a key role in reward-based learning by signaling prediction errors.

26
Q

How do VTA neurons respond when there is no prediction of a reward?

A

When there is no prediction, the conditioned stimulus (CS) does not elicit a response. If a reward occurs, dopamine neuron activity increases, signaling a positive prediction error.

27
Q

What happens in VTA neurons when a reward is predicted and occurs?

A

Dopamine activity increases when the CS appears (predicting the reward), but there is no change when the reward itself is received, as it was already expected.

28
Q

What happens when a reward is predicted but does not occur?

A

Dopamine activity increases when the CS appears but decreases (a dip in activity) when the reward fails to occur, signaling a negative prediction error.

29
Q

How does VTA dopamine activity associate with the conditioned stimulus (CS) rather than the reward itself?

A

The dopamine response shifts from the reward to the CS over time, as the CS comes to predict the reward. This demonstrates TD learning’s ability to adjust expectations based on timing and prediction.

30
Q

How does TD learning in the brain relate to the prediction error concept?

A

Dopaminergic neurons in the VTA signal the difference between expected and actual rewards (prediction error), adjusting learning and behavior to align with future expectations.

31
Q

Q-learning

A
  • adds action
  • models for both reward learning and value-based
    decision-making in humans
  • a type of reinforcement learning algorithm that helps agents learn the value of taking specific actions in given states to maximize rewards over time
32
Q

how does Q-learning relate to temporal difference learning, rescorla wagner, and DDM?

A
  1. Models (temporal) reward learning similar to TD/RW learning
  2. Models decision-making similar to DDM
33
Q

what is Q in Q-learning

A

the value of the stimulus

34
Q

Q-learning algorithm

A
  1. initialize Q-table
  2. choose an action a
  3. perform action
  4. measure reward
  5. update Q
  • repeat steps 2-5 until training is done
  • over time, the Q-value for each action is updated as the agent learns from rewards received after taking actions
35
Q

ΔQ-value

A
  • (Q_chosen) - (Q_unchosen)
  • if ΔQ is negative, the unchosen stimulus had the highest Q value.
36
Q

choice function

A
  • determines the probability of selecting a particular option t based on its Q-value.
  • i.e., decides which option to pick based on the Q-values calculated by the value function
37
Q

choice function: beta parameter

A
  • temperature parameter, which controls the exploration-exploitation balance
  • controls how sensitive the model’s choices are to Q-value differences
38
Q

low beta

A
  • more exploration (random choice)
  • smooth sigmoid curve for x=Qvalue, y=p(correct)
  • fluctuating Q-values since it learns from a wider range of experiences
  • fluctuating RPEs due to random exploratory behavior
39
Q

high beta

A
  • more exploitation (deterministic, highest Q-value)
  • sharp sigmoid curve for x=Qvalue, y=p(correct)
  • smooth Q-value evolution
  • smooth RPE evolution since the model sticks to predictable decisions based on the highest Q-value.
40
Q

choice function: trait & choice

A

the exploration-exploitation balance can be influenced by both traits (individual differences in decision-making styles) and choices (specific decisions made during a task).

41
Q

value function

A

updates the “Q-value” for the chosen action based on whether the feedback was positive or negative

  1. If you pick an option and get a reward, the Q-value increases based on a learning rate (α_gain)
  2. If you pick an option and get no reward, the Q-value adjusts downward using a different learning rate (α_loss)
42
Q

2 learning rates to update Q through the value function

A
  1. a_{gain}: speed of learning from positive feedback
  2. a_{loss}: speed of learning from negative feedback
  • i.e., these determine the speed of learning/updating Q-values
43
Q

increasing a

A

Q-values adjust more quickly, leading to larger swings in value estimates

44
Q

decreasing a

A

results in slower updates, producing smoother, more gradual changes in Q-values

45
Q

increasing a_{gain}

A
  • speeds learning up
  • Q-value differences reach asymptote earlier in the trials, showing faster differentiation between choice options
46
Q

decreasing a_{gain}

A
  • leads to slower learning
  • Q-value take longer to reach their asymptotes. the Q-values differences line line rises more gradually, indicating slower value differentiation
47
Q

increasing a_{loss}

A
  • makes the model ‘unlearn’ Q-values quickly after a negative outcome
  • the asymptotes in the Q-values go down
  • the Q-value differences line becomes more jagged, as strong updates from negative feedback create more fluctuations in learning
48
Q

decreasing a_{loss}

A

leads to negative feedback having less influence, so Q-values decrease more slowly after non-rewards. this makes Q-value differences smoother and less volatile over time.

49
Q

reward prediction error (RPE)

A
  • difference between expected and received feedback
  • also drives learning
50
Q

What changes when running the Q-learning simulation multiple times with the same parameter settings?

A
  • sequence of trials and specific trial outcomes (e.g., rewards and the precise pattern of reward prediction errors)
  • this randomness reflects the fact that a single parameter setting generates data from a distribution of possible outcomes rather than a fixed dataset.
51
Q

What remains the same when running the Q-learning simulation multiple times with the same parameter settings?

A
  • Q-value differences and evolution of reward prediction errors
  • these consistent patterns confirm that the parameter settings shape the underlying probabilities governing learning and decision-making.
52
Q

DRL: reinforcement learning

A
  • an agent interacts with an environment by taking actions and receiving rewards based on the state of the environment.
  • This approach works well for simple problems with a limited number of states but becomes infeasible as the environment’s complexity grows.
53
Q

DRL: deep learning

A

used for categorization problems, where a (supervised) classifier learns to categorize inputs based on labeled examples.

54
Q

DRL: deep reinforcement learning

A
  • combines deep learning and reinforcement learning to handle complex RL problems.
  • Instead of using a tabular approach, DRL uses deep neural networks to approximate the action-value function or policy, making it scalable to environments with high-dimensional state spaces.