Neural networks for reinforcement learning Flashcards
What neurological substrate do reinforcement models typically concern
Unit recordings from mesencephalic DA neurons in monkeys:
Can we explain their firing from models of Reinforcement Learning?
How do these DA neurons behave before learning?
DA cell responds to reward but not to the predictive CS (sound)
How do DA neurons behave following learning?
DA cell does not respond to reward when it is predicted by the CS; Backwards shift of response towards the CS itself!
Does this change if the reward comes unexpectantly?
Cells still responsive to reward when it comes unexpectedly
What temporal Dynamics are at play here?
There’s a fixed interval between sound and liquid reward, Sound is predictive of reward
What type of neural network architecture does this model employ?
(iii) Hybrid neural network: feed-forward and feedback/ recurrent connections
* Important subclass: reinforcement learning
Describe the architecture of a simple reinforcement learning network subclass of hybrid neural networks
Inputs (P1-5) provide semi connected inputs to:
Hidden layer (3 nodes) provides semi connected inputs to:
Output layer (a1,a2) provides input to:
The environment which both:
Delivers neutral sensory stimuli and
delivers reinforcement (punishment or reward) to post input layers
What could be biological correspondence to these variables?
p = sensory patterns such as sound
a = output, let’s say motor output
reward could be prey that was caught and eaten; punishment could be pain
What are some key features of reinforcement learning? (5)
- Instructive signal for learning is one scalar value for whole
network: the Reinforcement signal - Scalar value can be 1 bit (‘right or wrong’, 0 or 1) or can be
graded (‘pretty good…very good’) - Reinforcement Learning follows operant (instrumental) conditioning, but can also be applied to Pavlovian Conditioning
Stimulus -> Action -> Reinforcement => Modification of
network connections - Reinforcement Learning relies on learning with a critic” (was
the action good or bad?) - Scalar feedback: only tells how good/bad the action given the stimulus was
What can this learning with a critic be contrasted with? What method does this correspond to?
Reinforcement Learning relies on learning with a critic” (was
the action good or bad?)
Contrasts with: “learning with a teacher”
(what was right or wrong in any trial); backpropagation supervised learning
What can this scalar feedback not tell us?
Scalar feedback only tells how good/bad the action given the stimulus was ( just one numerical number; 0,1), not what the optimal output would have been
What is meant by the credit assignment problem in reinforcement learning? How can this be subdivided? (2)
- In real (and artificial) life, a reinforcement is usually obtained only after a long sequence of actions (e.g. playing chess – win/lose)
- temporal credit assignment problem: which individual move was particularly good or bad?
- structural credit assignment problem: which individual neuron
(unit) behaved correctly or erroneously?
In regards to the taxonomy of mammalian memory systems, where does reinforcement learning concern?
Non-declarative (implicit):RL resorts under stimulus-response skill learning (procedural learning) and classical conditioning
What neural substrates are often assigned to these kinds of learning?
Procedural learning: Striatum
Classical conditioning:
Emotional responses: Amygdala
Skeletal musculature: Cerebellum
Give the learning sequences for these types of learning respectively
skill: stimulus => action => outcome
classical cond: stimulus => outcome
Classical Reinforcement Learning captures only two elements of complex processes underlying operant conditioning, what are these? Describe their learning sequences in regards
Stimulus-response (operant) learning and Pavlovian association
Both concern the transition of a stimulus to a reinforcer
Pavlovian learning assigns motivational value to stimulus and elicits automated (‘reflexive’) reaction (no instrumental action needed to obtain outcome)
Stimulus-response (operant) learning first concerns the transition of a stimulus to a response and a response to a reinforcer
Experimental Psychology produced evidence for additional learning processes within stimulus-response (operant) learning, what are these? (The transitional processes )
- Habits: in real life stimulus-response learning eventually leads to habit formation (=weakly sensitive to reinforcement);
*Action-outcome learning (associating a (associated) response to a reinforcer)
What behaviour is associated with action-outcome learning according to Pennartz?
Goal Directedness to determine if response is needed for reinforcer
In pavlovian response is not important
How does this learning process contrast with backpropagation?
Critic evaluates whether the action is good or not rather than at the individual level; was the action good or not. Backprop would be learning with a teacher, quite artificial.
What is action-outcome learning about? (3)
:: Knowing what you need (most) and how to get it
:: Representing your goal before undertaking action
:: Knowing whether your action is relevant or not
Illustrate the importance of action outcome learning with a dilemma
“Castaway’s dilemma”; Why stimulus-response learning is not sufficient
A castaway on an island sees palm trees, what action should he carry out?
One stimulus, multiple options:
1) Search for coconuts?
2) Burn trees to get warm?
3) Build a raft to escape?
Stimulus-response learning does not solve the problem; Stimulus does not tell you what to do. Once goal is identified, you need to know the associated action required to achieve that goal
What conclusion ca be derived from this dilemma?
Conclusion: simple (Pavlovian) RL captures only part of complex forms of learning (but it’s a good start)
Name a different type of reinforcement learning related to GDP
Temporal Difference (TD) Learning
What is TD-RL learning suitable for?
multi-step tasks
Give an example illustrating an application TD-RL networks
Example of multi-step task: Tower of London task
- Requires planning of multi-step operations
- Typically depends on prefrontal cortex
- Can be successfully modelled using ‘Temporal Difference’ RL models
Describe the use of error in TD-RL networks
Error in reward prediction
Instead of direct feedback by actual reward, it is more efficient to use an Internal feedback signal. In practice, each task step can be associated with a reward prediction; the internal feedback signal is often an error in reward prediction:
𝛿 = (R –V)
with 𝛿: error in reward prediction
R: real, actual reward (at the end of the task trial)
V: predicted reward (=expected reward; expected value, or just: value)
Where does this concept of TD-RL networks data back to? What was posited?
Concepts date back to Robert Rescorla & Allan Wagner (1971, 1973)
- The expected reward fluctuates over time: Value function, V(t)
Describe the structure of the temporal difference learning network
Visual stimulus enters the sensory network. The subsequent reward prediction is sent to a node alongside the actual reward. The error in reward prediction is sent to the motor network so a response can be carried back AND back into the sensory network where a learning rule preposterror is carried out. The change in reward prediction (V(t)) over time ins then projected back to the node with the actual reward.
What applications can TD learning have outside neuro?
Computing Values & Errors, now and in the future; finance, covid numbers
How is the value function defined?
V(t) = E[γ^0.r(t) + γ^1.r(t+1) + γ^2.r(t+2) +…]
E[..] Expected value of the sum of all future and current rewards
r(t) actual reward at time t
γt discount factor: makes “early” rewards (at t) more important than rewards that arrive later (t+1); is smaller than 1.0
What does the value function allow the agent to do?
Agents may not know exactly what kind of rewards will come, but can estimate the future value of the situation at t: V(t)
How is the value function based on a cue and reinforcer?
Value function: based on temporal relationship between sensory cue (CS) and reinforcer (r)
CS => delay: 2 sec. => Reinforcer (classical conditioning)
- The cue directly activates the network, which is trying to generate a good estimate of future rewards: V(t)
- This prediction can only be improved when learning takes place (change synaptic weights)
The function V(t) is consistent over time, give an equation that demonstrates this
V(t) = E[r(t) + γV(t+1)]
V(t+1) represents all rewards expected after time t
The equation presents a fully learned situation: you know how much reward to expect after time = t
How do you compute the error in the reward prediction?
Bring the V(t) to the right-hand side; in case of perfect learning, zero would remain at the left-hand side
Imperfect learning means there is error in the reward prediction:
𝛿(t) = r(t) + γV(t+1) - V(t)
with:
𝛿(t): error in the reward prediction at time t
r(t): actual reward at time t
g:discount factor (makes t+1 slightly less important than t)
V(t) estimate of all future reward at time t
V(t+1) estimate of all future reward after time t (from t+1 onwards)
This is why the algorithm is called Temporal Difference Learning (t vs. t+1)
Give 3 relevant cases if no reward is expected after time t
1) reward present, but not predicted (V(t)=0) => 𝛿(t) > 0
2) reward absent, although predicted => 𝛿(t) < 0
3) learning complete: reward correctly predicted => 𝛿(t) = 0
How can you relate these rules to the network structure?
If error is 0, the prediction from sensory network is equal to sensory reward
+ error –> stronger synaptic connection needed –> until there is a match
What is still missing from this network compared to the equations?
Time is still a problem, usually stimulus is earlier than reward (could be hours between)
A significant Delta (error) can occur even before an actual reward has been received; this is called a “surrogate prediction error”.
e.g. r(t)=0 but gV(t+1) is large & positive and V(t)=0; learning will occur !
What could fulfil this role in the network?
Temporal Difference: role of dopamine? Could the role of “error-coding unit” fulfilled by Dopamine cells? (hypothesis)
Relate this concept to an application
An animal has learned that cue X (at t+1; deer) precedes reward by a short delay, but not yet that another cue Y (at t; scent) precedes reward by a slightly longer delay
What happens in the network as a result of the surrogate prediction error (d(t) > 0)?
- a value will be assigned to cue Y occurring at time t (input layer; cue Y = node )
- this assignment results in altered synaptic weights of neurons responsive to y (in hidden layer):
synaptic change = presynaptic input y * postsynaptic activity * error
What does this mean temporally during learning?
Thus, during learning, attribution of value occurs ‘backwards in time’
(“backwards referral”; because Y
precedes X)
What physiological network could this be mimicking?
Mesolimbic dopamine projections (VTA, ventral tegmental area) to striatum and frontal cortex
Ventral striatal back projections then project back to VTA
Describe research that investigates this relationship between reinforcement learning and this network
Same as previously with the unit recordings from mesencephalic DA neurons in monkeys:
Before learning:
DA cell responds to reward but not to the predictive CS (sound)
After learning:
DA cell does not respond to reward when it is predicted by the CS
Backwards shift of response towards the CS+ !
Evidence in favour of the TD model
Fully trained situation: Do the findings confirm a TD model of DA cells?
Reward at the expected time does not elicit a response in DA cells
*Reward omission at expected time: ~ decrease in firing
=>agrees with “negative error in reward prediction”
*Reward shifts to an unexpected time: increase in firing
=>agrees with “positive error in reward prediction”
Overall: strong evidence in favour of predictive coding-in-time
Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy
learning rule supposes convergence of DA terminals with glutamatergic inputs on dendrites– is this found?
=> perhaps in the striatum (triads are rare)
“Triadic” configuration: Cortical and dopamine afferents inputting on striatal neuron dendrites
synaptic change = presynaptic actpostsyn.acterror
Give some uncertainties about dopamine & reinforcement Learning models in terms of neurophysiology (2)
DA neurons can fire in advance of eye saccades, so before the animal may begin to identify a cue (CS)
DA neurons are more broadly tuned than just to reward or reward-predicting stimuli (e.g., novelty; uncertain outcome; movement; pain)
Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy
DA may be involved in associative learning because of a more general role in sensorimotor processes (e.g. Parkinson’s)
Give some uncertainties about dopamine & reinforcement Learning models in terms of cellular neurophysiology
does DA affect synaptic plasticity (e.g., LTP) according to the TD learning rule? => still debated
Overall what could you conclude based on this evidence?
DA may have a more general sensorimotor function – reacting to unexpected salient events; still uncertain whether it literally mediates TD-RL
What is an alternative to dopamine for these proposed circuits?
Pyramidal Cell networks in cortex & amygdala:
*Reinforcement Learning also possible using Glutamatergic signalling (-> pyr. cells)
*Brain networks (using Glu) need to store more than just stimulus value,
e.g. knowledge on outcome identities (”what”) =>
model-based learning is learning an Internal Model
of the causal relationships between specific stimuli, actions, outcomes
In this glutamine model, where could reinforcemet take place?
Medial PFC
Orbitofrontal PFC
Amygdala: can have error coding property according to research
Describe study structure for coding of reward and reward prediction in this network
Evidence for coding of reward and reward prediction in dorsolateral prefrontal cortex:
Paradigm for monkeys –> lever-press task –> can choose left or right for reward
Spatial delayed response task: instruction cue coupled to objects differing in reward value (raisins, apple and cabbage. The monkey would prefer raisins or apple to cabbage ) => testing the relative reward value of objects
Describe study results for coding of reward and reward prediction in this network
There is coupling between instruction cue and specific outcome observed
if 2 different rewards, one more preferred (A>B, ‘high’) –> when theres a choice, high firing activity for high preferred stimulus but lower firing activity to low preferred stimulus
Once B becomes the preferred food (over an even less preferred food C) –> will be reflected in firing activity through increase in FR
Therefore PFC cells can code relative expected reward
How compatible are these proposed DA network?
compatible with DA model –> PFC sends projections to VTA and SN
Compatible with:
a) Dopamine model of TDRL
(here, PFC may code V(t))
a) Glutamatergic model of RL