Reinforcement +DA Flashcards
Main idea in reward dependent learning
reward is entirely predicted by a sensory cue
Why might reward be driven by error?
once a rat learns that presentation of a light is consistently followed by food, no association will be developed to a new stimulus paired with the light (e.g. sound) Kamin, 1969) i.e. no further learning takes place. It appears therefore that learning is driven by deviations or “errors” between the predicted time and amount of rewards and their actual experienced times and magnitudes.
DA neurons in which areas are associated with reward-prediction?
Dopamine neurons of the ventral tegmental area (VTA) and substantia nigra have long been identified with the processing of rewarding stimuli. These neurons send their axons to brain structures involved in motivation and goal-directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.
Multiple lines of evidence including ___ suggest DA has a role in reward
from drugs like amphetamine and cocaine which exert their addictive actions in part by prolonging the influence of dopamine on target neurons (Koob, 1992) and studies of electrical self-stimulation where rats press bars to excite dopamine neurons at the site of an implanted electrode (Phillips, 1975) implicate midbrain dopaminergic activity in reward-dependent learning.
When measuring single dopamine neurons in monkeys presented
with various appetitive stimuli such as a morsel of apple (Schultz, 1986), dopamine neurons respond with short, phasic activations.
What are the characteristics of DA phasic activity?
These phasic activations do not, however, discriminate between these different types of rewarding stimuli and are not elicited by aversive stimuli like air puffs to the hand or drops of saline to the mouth. This homogenous response occurs in the majority of dopamine neurons (55 to 80%).
What happens to DA firing when a reward behaviour is learned?
Once a reward behaviour is learned, two remarkable changes occur in the dopamine neuron output: (i) the primary reward no longer elicits a phasic response; and (ii) the onset of the (predictive) stimulus now causes a phasic activation in dopamine cell output.
What happens when a reward stimulus is not presented?
In trials where the reward is not delivered at the appropriate time after the onset of the light, dopamine neurons are depressed markedly below their basal firing rate exactly at the time that the reward should have occurred.
What are the implications of these studies on DA firing?
These studies promote the idea that dopaminergic activity encodes expectations about external stimuli or reward.
Why has the TD algorithm been useful?
The TD algorithm is particularly well suited to understanding the functional role played by the dopamine signal in terms of the information it constructs and broadcasts.
What has TD work used to study DA?
This work has used fluctuations in dopamine activity in dual roles
(i) as a supervisory signal for synaptic weight changes and (ii) as a signal to influence directly and indirectly the choice of behavioral actions in humans and bees
What are the assumptions of TD?
First, the computational goal of learning is to use the sensory cues to predict a discounted sum of all future rewards V(t) within a learning trial
The second main assumption is the Markovian one, that is, the presentation of future sensory cues and rewards depends only on the immediate (current) sensory cues and not the past sensory cues.
What do the components of TD algorithm represent?
where r(t) is the reward at time t and E[·] denotes the expected value of the sum of future rewards up to the end of the trial. 0 ≤ γ ≤ 1 is a discount factor that makes rewards that arrive sooner more important than rewards that arrive later.
What does the definition of V(t) imply?
satisfies a condition of consistency through time
there is information available at each instant in time that can act as a surrogate prediction error
What is another way of representing TD?
An error in the estimated predictions can now be defined with information available at successive time steps
d(t) = r(t) + yV^(t+1) -V^(t)