Neural networks for reinforcement learning Flashcards

1
Q

What neurological substrate do reinforcement models typically concern

A

Unit recordings from mesencephalic DA neurons in monkeys:
Can we explain their firing from models of Reinforcement Learning?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do these DA neurons behave before learning?

A

DA cell responds to reward but not to the predictive CS (sound)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do DA neurons behave following learning?

A

DA cell does not respond to reward when it is predicted by the CS; Backwards shift of response towards the CS itself!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Does this change if the reward comes unexpectantly?

A

Cells still responsive to reward when it comes unexpectedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What temporal Dynamics are at play here?

A

There’s a fixed interval between sound and liquid reward, Sound is predictive of reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of neural network architecture does this model employ?

A

(iii) Hybrid neural network: feed-forward and feedback/ recurrent connections
* Important subclass: reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the architecture of a simple reinforcement learning network subclass of hybrid neural networks

A

Inputs (P1-5) provide semi connected inputs to:
Hidden layer (3 nodes) provides semi connected inputs to:
Output layer (a1,a2) provides input to:

The environment which both:
Delivers neutral sensory stimuli and
delivers reinforcement (punishment or reward) to post input layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What could be biological correspondence to these variables?

A

p = sensory patterns such as sound
a = output, let’s say motor output
reward could be prey that was caught and eaten; punishment could be pain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some key features of reinforcement learning? (5)

A
  • Instructive signal for learning is one scalar value for whole
    network: the Reinforcement signal
  • Scalar value can be 1 bit (‘right or wrong’, 0 or 1) or can be
    graded (‘pretty good…very good’)
  • Reinforcement Learning follows operant (instrumental) conditioning, but can also be applied to Pavlovian Conditioning
    Stimulus -> Action -> Reinforcement => Modification of
    network connections
  • Reinforcement Learning relies on learning with a critic” (was
    the action good or bad?)
  • Scalar feedback: only tells how good/bad the action given the stimulus was
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can this learning with a critic be contrasted with? What method does this correspond to?

A

Reinforcement Learning relies on learning with a critic” (was
the action good or bad?)
Contrasts with: “learning with a teacher”
(what was right or wrong in any trial); backpropagation supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can this scalar feedback not tell us?

A

Scalar feedback only tells how good/bad the action given the stimulus was ( just one numerical number; 0,1), not what the optimal output would have been

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is meant by the credit assignment problem in reinforcement learning? How can this be subdivided? (2)

A
  • In real (and artificial) life, a reinforcement is usually obtained only after a long sequence of actions (e.g. playing chess – win/lose)
  • temporal credit assignment problem: which individual move was particularly good or bad?
  • structural credit assignment problem: which individual neuron
    (unit) behaved correctly or erroneously?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In regards to the taxonomy of mammalian memory systems, where does reinforcement learning concern?

A

Non-declarative (implicit):RL resorts under stimulus-response skill learning (procedural learning) and classical conditioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What neural substrates are often assigned to these kinds of learning?

A

Procedural learning: Striatum
Classical conditioning:
Emotional responses: Amygdala
Skeletal musculature: Cerebellum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give the learning sequences for these types of learning respectively

A

skill: stimulus => action => outcome
classical cond: stimulus => outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Classical Reinforcement Learning captures only two elements of complex processes underlying operant conditioning, what are these? Describe their learning sequences in regards

A

Stimulus-response (operant) learning and Pavlovian association

Both concern the transition of a stimulus to a reinforcer

Pavlovian learning assigns motivational value to stimulus and elicits automated (‘reflexive’) reaction (no instrumental action needed to obtain outcome)

Stimulus-response (operant) learning first concerns the transition of a stimulus to a response and a response to a reinforcer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Experimental Psychology produced evidence for additional learning processes within stimulus-response (operant) learning, what are these? (The transitional processes )

A
  • Habits: in real life stimulus-response learning eventually leads to habit formation (=weakly sensitive to reinforcement);

*Action-outcome learning (associating a (associated) response to a reinforcer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What behaviour is associated with action-outcome learning according to Pennartz?

A

Goal Directedness to determine if response is needed for reinforcer

In pavlovian response is not important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does this learning process contrast with backpropagation?

A

Critic evaluates whether the action is good or not rather than at the individual level; was the action good or not. Backprop would be learning with a teacher, quite artificial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is action-outcome learning about? (3)

A

:: Knowing what you need (most) and how to get it
:: Representing your goal before undertaking action
:: Knowing whether your action is relevant or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Illustrate the importance of action outcome learning with a dilemma

A

“Castaway’s dilemma”; Why stimulus-response learning is not sufficient

A castaway on an island sees palm trees, what action should he carry out?
One stimulus, multiple options:
1) Search for coconuts?
2) Burn trees to get warm?
3) Build a raft to escape?

Stimulus-response learning does not solve the problem; Stimulus does not tell you what to do. Once goal is identified, you need to know the associated action required to achieve that goal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What conclusion ca be derived from this dilemma?

A

Conclusion: simple (Pavlovian) RL captures only part of complex forms of learning (but it’s a good start)

23
Q

Name a different type of reinforcement learning related to GDP

A

Temporal Difference (TD) Learning

24
Q

What is TD-RL learning suitable for?

A

multi-step tasks

25
Give an example illustrating an application TD-RL networks
Example of multi-step task: Tower of London task * Requires planning of multi-step operations * Typically depends on prefrontal cortex * Can be successfully modelled using ‘Temporal Difference’ RL models
26
Describe the use of error in TD-RL networks
Error in reward prediction Instead of direct feedback by actual reward, it is more efficient to use an Internal feedback signal. In practice, each task step can be associated with a reward prediction; the internal feedback signal is often an error in reward prediction: 𝛿 = (R –V) with 𝛿: error in reward prediction R: real, actual reward (at the end of the task trial) V: predicted reward (=expected reward; expected value, or just: value)
27
Where does this concept of TD-RL networks data back to? What was posited?
Concepts date back to Robert Rescorla & Allan Wagner (1971, 1973) * The expected reward fluctuates over time: Value function, V(t)
28
Describe the structure of the temporal difference learning network
Visual stimulus enters the sensory network. The subsequent reward prediction is sent to a node alongside the actual reward. The error in reward prediction is sent to the motor network so a response can be carried back AND back into the sensory network where a learning rule pre*post*error is carried out. The change in reward prediction (V(t)) over time ins then projected back to the node with the actual reward.
29
What applications can TD learning have outside neuro?
Computing Values & Errors, now and in the future; finance, covid numbers
30
How is the value function defined?
V(t) = E[γ^0.r(t) + γ^1.r(t+1) + γ^2.r(t+2) +...] E[..] Expected value of the sum of all future and current rewards r(t) actual reward at time t γt discount factor: makes “early” rewards (at t) more important than rewards that arrive later (t+1); is smaller than 1.0
31
What does the value function allow the agent to do?
Agents may not know exactly what kind of rewards will come, but can estimate the future value of the situation at t: V(t)
32
How is the value function based on a cue and reinforcer?
Value function: based on temporal relationship between sensory cue (CS) and reinforcer (r) CS => delay: 2 sec. => Reinforcer (classical conditioning) * The cue directly activates the network, which is trying to generate a good estimate of future rewards: V(t) * This prediction can only be improved when learning takes place (change synaptic weights)
33
The function V(t) is consistent over time, give an equation that demonstrates this
V(t) = E[r(t) + γV(t+1)] V(t+1) represents all rewards expected after time t The equation presents a fully learned situation: you know how much reward to expect after time = t
34
How do you compute the error in the reward prediction?
Bring the V(t) to the right-hand side; in case of perfect learning, zero would remain at the left-hand side Imperfect learning means there is error in the reward prediction: 𝛿(t) = r(t) + γV(t+1) - V(t) with: 𝛿(t): error in the reward prediction at time t r(t): actual reward at time t g:discount factor (makes t+1 slightly less important than t) V(t) estimate of all future reward at time t V(t+1) estimate of all future reward after time t (from t+1 onwards) This is why the algorithm is called Temporal Difference Learning (t vs. t+1)
35
Give 3 relevant cases if no reward is expected after time t
1) reward present, but not predicted (V(t)=0) => 𝛿(t) > 0 2) reward absent, although predicted => 𝛿(t) < 0 3) learning complete: reward correctly predicted => 𝛿(t) = 0
36
How can you relate these rules to the network structure?
If error is 0, the prediction from sensory network is equal to sensory reward + error --> stronger synaptic connection needed --> until there is a match
37
What is still missing from this network compared to the equations?
Time is still a problem, usually stimulus is earlier than reward (could be hours between) A significant Delta (error) can occur even before an actual reward has been received; this is called a “surrogate prediction error”. e.g. r(t)=0 but gV(t+1) is large & positive and V(t)=0; learning will occur !
38
What could fulfil this role in the network?
Temporal Difference: role of dopamine? Could the role of “error-coding unit” fulfilled by Dopamine cells? (hypothesis)
39
Relate this concept to an application
An animal has learned that cue X (at t+1; deer) precedes reward by a short delay, but not yet that another cue Y (at t; scent) precedes reward by a slightly longer delay
40
What happens in the network as a result of the surrogate prediction error (d(t) > 0)?
- a value will be assigned to cue Y occurring at time t (input layer; cue Y = node ) - this assignment results in altered synaptic weights of neurons responsive to y (in hidden layer): synaptic change = presynaptic input y * postsynaptic activity * error
41
What does this mean temporally during learning?
Thus, during learning, attribution of value occurs ‘backwards in time’ (“backwards referral”; because Y precedes X)
42
What physiological network could this be mimicking?
Mesolimbic dopamine projections (VTA, ventral tegmental area) to striatum and frontal cortex Ventral striatal back projections then project back to VTA
43
Describe research that investigates this relationship between reinforcement learning and this network
Same as previously with the unit recordings from mesencephalic DA neurons in monkeys: Before learning: DA cell responds to reward but not to the predictive CS (sound) After learning: DA cell does not respond to reward when it is predicted by the CS **Backwards shift of response towards the CS+ !** Evidence in favour of the TD model
44
Fully trained situation: Do the findings confirm a TD model of DA cells?
Reward at the expected time does not elicit a response in DA cells *Reward omission at expected time: ~ decrease in firing =>agrees with “negative error in reward prediction” *Reward shifts to an unexpected time: increase in firing =>agrees with “positive error in reward prediction” Overall: strong evidence in favour of predictive coding-in-time
45
Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy
learning rule supposes convergence of DA terminals with glutamatergic inputs on dendrites– is this found? => perhaps in the striatum (triads are rare) “Triadic” configuration: Cortical and dopamine afferents inputting on striatal neuron dendrites synaptic change = presynaptic act*postsyn.act*error
46
Give some uncertainties about dopamine & reinforcement Learning models in terms of neurophysiology (2)
DA neurons can fire in advance of eye saccades, so before the animal may begin to identify a cue (CS) DA neurons are more broadly tuned than just to reward or reward-predicting stimuli (e.g., novelty; uncertain outcome; movement; pain)
47
Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy
DA may be involved in associative learning because of a more general role in sensorimotor processes (e.g. Parkinson’s)
48
Give some uncertainties about dopamine & reinforcement Learning models in terms of cellular neurophysiology
does DA affect synaptic plasticity (e.g., LTP) according to the TD learning rule? => still debated
49
Overall what could you conclude based on this evidence?
DA may have a more general sensorimotor function – reacting to unexpected salient events; still uncertain whether it literally mediates TD-RL
50
What is an alternative to dopamine for these proposed circuits?
Pyramidal Cell networks in cortex & amygdala: *Reinforcement Learning also possible using Glutamatergic signalling (-> pyr. cells) *Brain networks (using Glu) need to store more than just stimulus value, e.g. knowledge on outcome identities (”what”) => model-based learning is learning an Internal Model of the causal relationships between specific stimuli, actions, outcomes
51
In this glutamine model, where could reinforcemet take place?
Medial PFC Orbitofrontal PFC Amygdala: can have error coding property according to research
52
Describe study structure for coding of reward and reward prediction in this network
Evidence for coding of reward and reward prediction in dorsolateral prefrontal cortex: Paradigm for monkeys --> lever-press task --> can choose left or right for reward Spatial delayed response task: instruction cue coupled to objects differing in reward value (raisins, apple and cabbage. The monkey would prefer raisins or apple to cabbage ) => testing the relative reward value of objects
53
Describe study results for coding of reward and reward prediction in this network
There is coupling between instruction cue and specific outcome observed if 2 different rewards, one more preferred (A>B, 'high') --> when theres a choice, high firing activity for high preferred stimulus but lower firing activity to low preferred stimulus Once B becomes the preferred food (over an even less preferred food C) --> will be reflected in firing activity through increase in FR Therefore PFC cells can code relative expected reward
54
How compatible are these proposed DA network?
compatible with DA model --> PFC sends projections to VTA and SN Compatible with: a) Dopamine model of TDRL (here, PFC may code V(t)) a) Glutamatergic model of RL