Neural networks for reinforcement learning Flashcards

1
Q

What neurological substrate do reinforcement models typically concern

A

Unit recordings from mesencephalic DA neurons in monkeys:
Can we explain their firing from models of Reinforcement Learning?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do these DA neurons behave before learning?

A

DA cell responds to reward but not to the predictive CS (sound)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do DA neurons behave following learning?

A

DA cell does not respond to reward when it is predicted by the CS; Backwards shift of response towards the CS itself!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Does this change if the reward comes unexpectantly?

A

Cells still responsive to reward when it comes unexpectedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What temporal Dynamics are at play here?

A

There’s a fixed interval between sound and liquid reward, Sound is predictive of reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of neural network architecture does this model employ?

A

(iii) Hybrid neural network: feed-forward and feedback/ recurrent connections
* Important subclass: reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the architecture of a simple reinforcement learning network subclass of hybrid neural networks

A

Inputs (P1-5) provide semi connected inputs to:
Hidden layer (3 nodes) provides semi connected inputs to:
Output layer (a1,a2) provides input to:

The environment which both:
Delivers neutral sensory stimuli and
delivers reinforcement (punishment or reward) to post input layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What could be biological correspondence to these variables?

A

p = sensory patterns such as sound
a = output, let’s say motor output
reward could be prey that was caught and eaten; punishment could be pain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some key features of reinforcement learning? (5)

A
  • Instructive signal for learning is one scalar value for whole
    network: the Reinforcement signal
  • Scalar value can be 1 bit (‘right or wrong’, 0 or 1) or can be
    graded (‘pretty good…very good’)
  • Reinforcement Learning follows operant (instrumental) conditioning, but can also be applied to Pavlovian Conditioning
    Stimulus -> Action -> Reinforcement => Modification of
    network connections
  • Reinforcement Learning relies on learning with a critic” (was
    the action good or bad?)
  • Scalar feedback: only tells how good/bad the action given the stimulus was
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can this learning with a critic be contrasted with? What method does this correspond to?

A

Reinforcement Learning relies on learning with a critic” (was
the action good or bad?)
Contrasts with: “learning with a teacher”
(what was right or wrong in any trial); backpropagation supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can this scalar feedback not tell us?

A

Scalar feedback only tells how good/bad the action given the stimulus was ( just one numerical number; 0,1), not what the optimal output would have been

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is meant by the credit assignment problem in reinforcement learning? How can this be subdivided? (2)

A
  • In real (and artificial) life, a reinforcement is usually obtained only after a long sequence of actions (e.g. playing chess – win/lose)
  • temporal credit assignment problem: which individual move was particularly good or bad?
  • structural credit assignment problem: which individual neuron
    (unit) behaved correctly or erroneously?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In regards to the taxonomy of mammalian memory systems, where does reinforcement learning concern?

A

Non-declarative (implicit):RL resorts under stimulus-response skill learning (procedural learning) and classical conditioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What neural substrates are often assigned to these kinds of learning?

A

Procedural learning: Striatum
Classical conditioning:
Emotional responses: Amygdala
Skeletal musculature: Cerebellum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give the learning sequences for these types of learning respectively

A

skill: stimulus => action => outcome
classical cond: stimulus => outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Classical Reinforcement Learning captures only two elements of complex processes underlying operant conditioning, what are these? Describe their learning sequences in regards

A

Stimulus-response (operant) learning and Pavlovian association

Both concern the transition of a stimulus to a reinforcer

Pavlovian learning assigns motivational value to stimulus and elicits automated (‘reflexive’) reaction (no instrumental action needed to obtain outcome)

Stimulus-response (operant) learning first concerns the transition of a stimulus to a response and a response to a reinforcer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Experimental Psychology produced evidence for additional learning processes within stimulus-response (operant) learning, what are these? (The transitional processes )

A
  • Habits: in real life stimulus-response learning eventually leads to habit formation (=weakly sensitive to reinforcement);

*Action-outcome learning (associating a (associated) response to a reinforcer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What behaviour is associated with action-outcome learning according to Pennartz?

A

Goal Directedness to determine if response is needed for reinforcer

In pavlovian response is not important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does this learning process contrast with backpropagation?

A

Critic evaluates whether the action is good or not rather than at the individual level; was the action good or not. Backprop would be learning with a teacher, quite artificial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is action-outcome learning about? (3)

A

:: Knowing what you need (most) and how to get it
:: Representing your goal before undertaking action
:: Knowing whether your action is relevant or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Illustrate the importance of action outcome learning with a dilemma

A

“Castaway’s dilemma”; Why stimulus-response learning is not sufficient

A castaway on an island sees palm trees, what action should he carry out?
One stimulus, multiple options:
1) Search for coconuts?
2) Burn trees to get warm?
3) Build a raft to escape?

Stimulus-response learning does not solve the problem; Stimulus does not tell you what to do. Once goal is identified, you need to know the associated action required to achieve that goal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What conclusion ca be derived from this dilemma?

A

Conclusion: simple (Pavlovian) RL captures only part of complex forms of learning (but it’s a good start)

23
Q

Name a different type of reinforcement learning related to GDP

A

Temporal Difference (TD) Learning

24
Q

What is TD-RL learning suitable for?

A

multi-step tasks

25
Q

Give an example illustrating an application TD-RL networks

A

Example of multi-step task: Tower of London task

  • Requires planning of multi-step operations
  • Typically depends on prefrontal cortex
  • Can be successfully modelled using ‘Temporal Difference’ RL models
26
Q

Describe the use of error in TD-RL networks

A

Error in reward prediction

Instead of direct feedback by actual reward, it is more efficient to use an Internal feedback signal. In practice, each task step can be associated with a reward prediction; the internal feedback signal is often an error in reward prediction:

𝛿 = (R –V)

with 𝛿: error in reward prediction
R: real, actual reward (at the end of the task trial)
V: predicted reward (=expected reward; expected value, or just: value)

27
Q

Where does this concept of TD-RL networks data back to? What was posited?

A

Concepts date back to Robert Rescorla & Allan Wagner (1971, 1973)

  • The expected reward fluctuates over time: Value function, V(t)
28
Q

Describe the structure of the temporal difference learning network

A

Visual stimulus enters the sensory network. The subsequent reward prediction is sent to a node alongside the actual reward. The error in reward prediction is sent to the motor network so a response can be carried back AND back into the sensory network where a learning rule preposterror is carried out. The change in reward prediction (V(t)) over time ins then projected back to the node with the actual reward.

29
Q

What applications can TD learning have outside neuro?

A

Computing Values & Errors, now and in the future; finance, covid numbers

30
Q

How is the value function defined?

A

V(t) = E[γ^0.r(t) + γ^1.r(t+1) + γ^2.r(t+2) +…]

E[..] Expected value of the sum of all future and current rewards

r(t) actual reward at time t

γt discount factor: makes “early” rewards (at t) more important than rewards that arrive later (t+1); is smaller than 1.0

31
Q

What does the value function allow the agent to do?

A

Agents may not know exactly what kind of rewards will come, but can estimate the future value of the situation at t: V(t)

32
Q

How is the value function based on a cue and reinforcer?

A

Value function: based on temporal relationship between sensory cue (CS) and reinforcer (r)

CS => delay: 2 sec. => Reinforcer (classical conditioning)

  • The cue directly activates the network, which is trying to generate a good estimate of future rewards: V(t)
  • This prediction can only be improved when learning takes place (change synaptic weights)
33
Q

The function V(t) is consistent over time, give an equation that demonstrates this

A

V(t) = E[r(t) + γV(t+1)]
V(t+1) represents all rewards expected after time t

The equation presents a fully learned situation: you know how much reward to expect after time = t

34
Q

How do you compute the error in the reward prediction?

A

Bring the V(t) to the right-hand side; in case of perfect learning, zero would remain at the left-hand side

Imperfect learning means there is error in the reward prediction:
𝛿(t) = r(t) + γV(t+1) - V(t)

with:
𝛿(t): error in the reward prediction at time t
r(t): actual reward at time t
g:discount factor (makes t+1 slightly less important than t)
V(t) estimate of all future reward at time t
V(t+1) estimate of all future reward after time t (from t+1 onwards)

This is why the algorithm is called Temporal Difference Learning (t vs. t+1)

35
Q

Give 3 relevant cases if no reward is expected after time t

A

1) reward present, but not predicted (V(t)=0) => 𝛿(t) > 0
2) reward absent, although predicted => 𝛿(t) < 0
3) learning complete: reward correctly predicted => 𝛿(t) = 0

36
Q

How can you relate these rules to the network structure?

A

If error is 0, the prediction from sensory network is equal to sensory reward

+ error –> stronger synaptic connection needed –> until there is a match

37
Q

What is still missing from this network compared to the equations?

A

Time is still a problem, usually stimulus is earlier than reward (could be hours between)

A significant Delta (error) can occur even before an actual reward has been received; this is called a “surrogate prediction error”.

e.g. r(t)=0 but gV(t+1) is large & positive and V(t)=0; learning will occur !

38
Q

What could fulfil this role in the network?

A

Temporal Difference: role of dopamine? Could the role of “error-coding unit” fulfilled by Dopamine cells? (hypothesis)

39
Q

Relate this concept to an application

A

An animal has learned that cue X (at t+1; deer) precedes reward by a short delay, but not yet that another cue Y (at t; scent) precedes reward by a slightly longer delay

40
Q

What happens in the network as a result of the surrogate prediction error (d(t) > 0)?

A
  • a value will be assigned to cue Y occurring at time t (input layer; cue Y = node )
  • this assignment results in altered synaptic weights of neurons responsive to y (in hidden layer):

synaptic change = presynaptic input y * postsynaptic activity * error

41
Q

What does this mean temporally during learning?

A

Thus, during learning, attribution of value occurs ‘backwards in time’

(“backwards referral”; because Y
precedes X)

42
Q

What physiological network could this be mimicking?

A

Mesolimbic dopamine projections (VTA, ventral tegmental area) to striatum and frontal cortex

Ventral striatal back projections then project back to VTA

43
Q

Describe research that investigates this relationship between reinforcement learning and this network

A

Same as previously with the unit recordings from mesencephalic DA neurons in monkeys:

Before learning:
DA cell responds to reward but not to the predictive CS (sound)

After learning:
DA cell does not respond to reward when it is predicted by the CS
Backwards shift of response towards the CS+ !

Evidence in favour of the TD model

44
Q

Fully trained situation: Do the findings confirm a TD model of DA cells?

A

Reward at the expected time does not elicit a response in DA cells

*Reward omission at expected time: ~ decrease in firing
=>agrees with “negative error in reward prediction”

*Reward shifts to an unexpected time: increase in firing
=>agrees with “positive error in reward prediction”

Overall: strong evidence in favour of predictive coding-in-time

45
Q

Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy

A

learning rule supposes convergence of DA terminals with glutamatergic inputs on dendrites– is this found?
=> perhaps in the striatum (triads are rare)

“Triadic” configuration: Cortical and dopamine afferents inputting on striatal neuron dendrites
synaptic change = presynaptic actpostsyn.acterror

46
Q

Give some uncertainties about dopamine & reinforcement Learning models in terms of neurophysiology (2)

A

DA neurons can fire in advance of eye saccades, so before the animal may begin to identify a cue (CS)

DA neurons are more broadly tuned than just to reward or reward-predicting stimuli (e.g., novelty; uncertain outcome; movement; pain)

47
Q

Give some uncertainties about dopamine & reinforcement Learning models in terms of neuroanatomy

A

DA may be involved in associative learning because of a more general role in sensorimotor processes (e.g. Parkinson’s)

48
Q

Give some uncertainties about dopamine & reinforcement Learning models in terms of cellular neurophysiology

A

does DA affect synaptic plasticity (e.g., LTP) according to the TD learning rule? => still debated

49
Q

Overall what could you conclude based on this evidence?

A

DA may have a more general sensorimotor function – reacting to unexpected salient events; still uncertain whether it literally mediates TD-RL

50
Q

What is an alternative to dopamine for these proposed circuits?

A

Pyramidal Cell networks in cortex & amygdala:

*Reinforcement Learning also possible using Glutamatergic signalling (-> pyr. cells)
*Brain networks (using Glu) need to store more than just stimulus value,
e.g. knowledge on outcome identities (”what”) =>
model-based learning is learning an Internal Model
of the causal relationships between specific stimuli, actions, outcomes

51
Q

In this glutamine model, where could reinforcemet take place?

A

Medial PFC
Orbitofrontal PFC
Amygdala: can have error coding property according to research

52
Q

Describe study structure for coding of reward and reward prediction in this network

A

Evidence for coding of reward and reward prediction in dorsolateral prefrontal cortex:

Paradigm for monkeys –> lever-press task –> can choose left or right for reward

Spatial delayed response task: instruction cue coupled to objects differing in reward value (raisins, apple and cabbage. The monkey would prefer raisins or apple to cabbage ) => testing the relative reward value of objects

53
Q

Describe study results for coding of reward and reward prediction in this network

A

There is coupling between instruction cue and specific outcome observed

if 2 different rewards, one more preferred (A>B, ‘high’) –> when theres a choice, high firing activity for high preferred stimulus but lower firing activity to low preferred stimulus

Once B becomes the preferred food (over an even less preferred food C) –> will be reflected in firing activity through increase in FR

Therefore PFC cells can code relative expected reward

54
Q

How compatible are these proposed DA network?

A

compatible with DA model –> PFC sends projections to VTA and SN

Compatible with:
a) Dopamine model of TDRL
(here, PFC may code V(t))
a) Glutamatergic model of RL