Neural networks for reinforcement learning Flashcards

Question

Give an example illustrating an application TD-RL networks

Answer 1

Example of multi-step task: Tower of London task * Requires planning of multi-step operations * Typically depends on prefrontal cortex * Can be successfully modelled using ‘Temporal Difference’ RL models

Answer 2

Error in reward prediction Instead of direct feedback by actual reward, it is more efficient to use an Internal feedback signal. In practice, each task step can be associated with a reward prediction; the internal feedback signal is often an error in reward prediction: 𝛿 = (R –V) with 𝛿: error in reward prediction R: real, actual reward (at the end of the task trial) V: predicted reward (=expected reward; expected value, or just: value)

Answer 3

Concepts date back to Robert Rescorla & Allan Wagner (1971, 1973) * The expected reward fluctuates over time: Value function, V(t)

Answer 4

Visual stimulus enters the sensory network. The subsequent reward prediction is sent to a node alongside the actual reward. The error in reward prediction is sent to the motor network so a response can be carried back AND back into the sensory network where a learning rule pre*post*error is carried out. The change in reward prediction (V(t)) over time ins then projected back to the node with the actual reward.

Answer 5

Computing Values & Errors, now and in the future; finance, covid numbers

Answer 6

V(t) = E[γ^0.r(t) + γ^1.r(t+1) + γ^2.r(t+2) +...] E[..] Expected value of the sum of all future and current rewards r(t) actual reward at time t γt discount factor: makes “early” rewards (at t) more important than rewards that arrive later (t+1); is smaller than 1.0

Answer 7

Agents may not know exactly what kind of rewards will come, but can estimate the future value of the situation at t: V(t)

Answer 8

Value function: based on temporal relationship between sensory cue (CS) and reinforcer (r) CS => delay: 2 sec. => Reinforcer (classical conditioning) * The cue directly activates the network, which is trying to generate a good estimate of future rewards: V(t) * This prediction can only be improved when learning takes place (change synaptic weights)

Answer 9

V(t) = E[r(t) + γV(t+1)] V(t+1) represents all rewards expected after time t The equation presents a fully learned situation: you know how much reward to expect after time = t

Answer 10

Bring the V(t) to the right-hand side; in case of perfect learning, zero would remain at the left-hand side Imperfect learning means there is error in the reward prediction: 𝛿(t) = r(t) + γV(t+1) - V(t) with: 𝛿(t): error in the reward prediction at time t r(t): actual reward at time t g:discount factor (makes t+1 slightly less important than t) V(t) estimate of all future reward at time t V(t+1) estimate of all future reward after time t (from t+1 onwards) This is why the algorithm is called Temporal Difference Learning (t vs. t+1)

Answer 11

1) reward present, but not predicted (V(t)=0) => 𝛿(t) > 0 2) reward absent, although predicted => 𝛿(t) < 0 3) learning complete: reward correctly predicted => 𝛿(t) = 0

Answer 12

If error is 0, the prediction from sensory network is equal to sensory reward + error --> stronger synaptic connection needed --> until there is a match

Answer 13

Time is still a problem, usually stimulus is earlier than reward (could be hours between) A significant Delta (error) can occur even before an actual reward has been received; this is called a “surrogate prediction error”. e.g. r(t)=0 but gV(t+1) is large & positive and V(t)=0; learning will occur !

Answer 14

Temporal Difference: role of dopamine? Could the role of “error-coding unit” fulfilled by Dopamine cells? (hypothesis)

Answer 15

An animal has learned that cue X (at t+1; deer) precedes reward by a short delay, but not yet that another cue Y (at t; scent) precedes reward by a slightly longer delay

Answer 16

- a value will be assigned to cue Y occurring at time t (input layer; cue Y = node ) - this assignment results in altered synaptic weights of neurons responsive to y (in hidden layer): synaptic change = presynaptic input y * postsynaptic activity * error

Answer 17

Thus, during learning, attribution of value occurs ‘backwards in time’ (“backwards referral”; because Y precedes X)

Answer 18

Mesolimbic dopamine projections (VTA, ventral tegmental area) to striatum and frontal cortex Ventral striatal back projections then project back to VTA

Answer 19

Same as previously with the unit recordings from mesencephalic DA neurons in monkeys: Before learning: DA cell responds to reward but not to the predictive CS (sound) After learning: DA cell does not respond to reward when it is predicted by the CS **Backwards shift of response towards the CS+ !** Evidence in favour of the TD model

Answer 20

Reward at the expected time does not elicit a response in DA cells *Reward omission at expected time: ~ decrease in firing =>agrees with “negative error in reward prediction” *Reward shifts to an unexpected time: increase in firing =>agrees with “positive error in reward prediction” Overall: strong evidence in favour of predictive coding-in-time

Answer 21

learning rule supposes convergence of DA terminals with glutamatergic inputs on dendrites– is this found? => perhaps in the striatum (triads are rare) “Triadic” configuration: Cortical and dopamine afferents inputting on striatal neuron dendrites synaptic change = presynaptic act*postsyn.act*error

Answer 22

DA neurons can fire in advance of eye saccades, so before the animal may begin to identify a cue (CS) DA neurons are more broadly tuned than just to reward or reward-predicting stimuli (e.g., novelty; uncertain outcome; movement; pain)

Answer 23

DA may be involved in associative learning because of a more general role in sensorimotor processes (e.g. Parkinson’s)

Answer 24

does DA affect synaptic plasticity (e.g., LTP) according to the TD learning rule? => still debated

Answer 25

DA may have a more general sensorimotor function – reacting to unexpected salient events; still uncertain whether it literally mediates TD-RL

Answer 26

Pyramidal Cell networks in cortex & amygdala: *Reinforcement Learning also possible using Glutamatergic signalling (-> pyr. cells) *Brain networks (using Glu) need to store more than just stimulus value, e.g. knowledge on outcome identities (”what”) => model-based learning is learning an Internal Model of the causal relationships between specific stimuli, actions, outcomes

Answer 27

Medial PFC Orbitofrontal PFC Amygdala: can have error coding property according to research

Answer 28

Evidence for coding of reward and reward prediction in dorsolateral prefrontal cortex: Paradigm for monkeys --> lever-press task --> can choose left or right for reward Spatial delayed response task: instruction cue coupled to objects differing in reward value (raisins, apple and cabbage. The monkey would prefer raisins or apple to cabbage ) => testing the relative reward value of objects

Answer 29

There is coupling between instruction cue and specific outcome observed if 2 different rewards, one more preferred (A>B, 'high') --> when theres a choice, high firing activity for high preferred stimulus but lower firing activity to low preferred stimulus Once B becomes the preferred food (over an even less preferred food C) --> will be reflected in firing activity through increase in FR Therefore PFC cells can code relative expected reward

Answer 30

compatible with DA model --> PFC sends projections to VTA and SN Compatible with: a) Dopamine model of TDRL (here, PFC may code V(t)) a) Glutamatergic model of RL

Neural networks for reinforcement learning Flashcards

(54 cards)