Week 9: Rescorla-Wagner and Temporal Difference Learning Flashcards
There are other related model-free algorithms
such as the temporal difference learning
Temporal Difference Learning (2)
- It is also a model-free RL algorithm
- It is different from Q-learning
Definition of classical conditoning
a learning process that occurs when two stimuli are repeatedly paired; a response that is at first elicited by the second stimulus is eventually elicited by the first stimulus alone.
UCS
Unconditioned stimulus
UCS is a stimulus that
leads to an automatic response
Neutral stimulus is a
stimulus that does not trigger a response on its own
Conditioned stimulus is
a stimulus that was once neutral (did not trigger a response) but now leads to a response
Unconditioned response (UCR)
is an automatic response tha occurs without thought when an unconditioned stimulsu is present
Conditioned response (CR)
Is a learned response created where no response existed before
Pavlov’s Dog Experiment (4)
- Before conditioning, Dog was presented with food (UCS) that lead to automatically trigger a salivation response (UCR)
- Before condtioning, Dog heard a bell rang (NS) which lead to no response from dog
- During conditoning, dog presented with food (UCS) and sound of bell (NS) which lead to salivation (UCR)
- After conditioning, the dog salivated (CR) when he heard the bell rang (CS)
Diagram of Pavlov’s dog: classical conditioning experiment,
Now we view S as
conditoned stimulus (bell after pairing)
Do not confuse ‘S’ as
state
Now we view R as
reinforcement (i.e., the food)
Table of acqusition, extinction, partial reinforcement in classical conditoning Diagram
Acqusition in table is where (3)
S (CS) is paired with reward in Phase 1
Nothing in Phase 2
Then we get response to S
Extinction in classical condtioning table outcome is where (3)
S (CS) paired with R In Phase 1
Present S (CS) on its own in Phase 2
Then see no response to S
Partial Reinforcement in table of classical condtioning (2)
where we ocassionally present S with R
Lead to weak response to S
Simple way to model that table of classical condtioning (CC) (of acquisition, extinction and partial reinforcement is) (2)
Stimulus neuron which has an input weight to a reward neuron
If the r neuron is active, we predict reinforcement reward/punishment
Use a simple delta-rule model in simple model of CC (of acqusition, extinction and partial reinforcement)
If S stimulus is present… (2)
S=1 (S=0 if not present):
then update weight: w → w + εSδ
δ
is delta
ε
epilson , a very small number
What is δ in simple delta-rule model of simple model of CC
w → w + εSδ
(4)
δ = r - wS,
i.e. the difference between actual reward and predicted (wS) reward.
This is error-driven learning = changing w such delta = 0 in presence of stimulus S
The error between predicted and actual reward is 0 so prediction is perfect and know what to expect
Diagram of graph of simple model of CC using simple delta rule model ( explaining acqusition, extinction and partial reinforcement (3)
When the weight grows slowly = acqusition phase
When we do not present rewward the weight decays = extinction.
Partial reinforcement
Diagram of 2 stimuli that predict reinforcement model (2)
We have S1,S2 predict reinforcement
If this neuron active (reinforcement neuron) ,then we predict reinforcement reward/punishment
V is the
expected reinforcement r reward based on all stimuli
Two ways to model that 2 stimuli predict reinforcement? (2)
Do we calculate delta for each S? (stimuli;CS)
Or do we calculate the sum over the Ss
Blocking experiments (3) by Kamin 1969
- if a dog is repeatedly exposed to a tone (the first conditioned stimulus, CS1), together with food (the unconditioned stimulus, US), the dog salivates when the tone is presented (conditioned response, CR).
- After several consecutive conditioning trials, this time with the tone (CS1) and a light (CS2) together with the US, the dog does not salivate/weak response to the light (CS2) when tested separately later.
- Stimulus control by CS2 has then been blocked by the earlier pairing of CS1 with the US
Overshadowing vs Blocking
Overshadowing experiments (5)
In the example of blocking, the light was pretrained prior to being compounded with the tone, and subjects learned little about the tone-the added element.
Sometimes, even If there is no prior training of an element of a compound CS, subjects will still learn little about one ofthe elements.
This occurs if one element is more “salient” than the other (other things being equal, a subject trained with a more salient CS will learn more rapidly than a subject trained with a less salient CS).
Ifa light CS is more salient than a tone CS, the effect of pairing a UCS with the light + tone compound will be to strongly associate the light with the UCS (food), with little associative strength developing between the tone and the UCS (the light overshadows the tone)
There will then be no response to S2 (tone) = salivation
CC (classical conditioning) can be modelled with
R-W rule
Blocking and overshadowing experiments indicate how to model 2 stimuli predicting reinforcement is (3)
Seond way by taking the difference between reinforcement and expected reinforcement (V) given all stimuli
(Reinforcement = general term could be reward or punishment)
Producing a single error term for all stimuli (this is known as Rescorla-Wagner rule)
What is Rescorla-Wagner Model? (2)
formal model of the circumstances under which Pavlovian conditioning occurs.
It attempts to describe the changes in associative strength (V) between a signal (conditioned stimulus, CS) and the subsequent stimulus (unconditioned stimulus, US) as a result of a conditioning trial.
Expected reinforcement , V, has its
own formula
What about temporal sequence of stimuli such that (4)
- Stimulus 1 (CS) in Phase 1 , leads to R (reinforcement = reward/punishment)
- Introduce Stimulus 2 (CS) (S2) which predicts S1
- Test the response to S2
- This is 2nd order condtioning
What is 2nd order conditoning?
Second-order conditioning (SOC) describes a phenomenon whereby a conditioned stimulus (CS) acquires the ability to elicit a conditioned response (CR) without ever being directly paired with an unconditioned stimulus (US)
2nd order conditioning example (5)
For instance, second-order conditioning can be demonstrated using the following procedure:
a CS1 (e.g., a light) is paired with a UCS (e.g., food) in phase 1;
then CS2 (e.g., a tone) is paired with CS1 (the light) in phase 2.
Tested response to S2?
This will usually result in a CR relevant to the original UCS (food) being evoked by CS2, even though CS2 has never been directly paired with food (e.g., Rescorla, 1980; Rizley & Rescorla, 1972).
The Rescorla-Wagner rule only works for
direct associations of S (CS) with R
The R-W rule does not work in temporal sequence of stimuli since (2)
there is no ‘r’ in phase 2 as delta rule depends on r!
As our delta is difference between received and expected reward
The problem of R-W does not work with temporal sequence of stimuli relates to temporal credit assignment problem again (2)
As time of presentation of S2 we don’t know if it will lead to reward
We don’t know past actions were pviotal for good outcome
In order to know which of the past actions was pviotal for a good outcome we need (temporal credit assignment problem) - (2)
to consider a time a stimulus/state occured that is predictive of future reward
i.e., Linking classical conditoning with RL (reinforcement learning)
Model that links RL and CC together is called
temporal difference learning model
In temporal difference learning model we would need the
We need the value function V at time t (Vt) to predict the sum of future rewards, not just immediate rewards so we can learn S2 predicts S1 which predicts R (reinforcement = general like food reward/punishment)
In temporal difference learning model we would need the V(t) to predict the sum of future reward, not just immediate r(t) so we can learn S2 => S1 => R
i.e., we want (2)
the sum of all reward at times tau greater than now (present) (at time t)
decompose V(t) to current reward and estimate of subsequent reward
To sum of all reward at times taou greater than now (at time t) we need to do this by using the (3) - Temporal Difference Learning
delta rule to ensure this happens:
Delta at time t is current reward + future reward - expected reinforcement (what I expect)
So delta becomes the difference between (expected) reward now and the estimate of all future reward
Comparing delta rule in temporal difference learning to R-W rule
Delta is based on current reward only as its delta is the difference between current reward (reinforcement) and expected reinforcement given all stimuli
Our TDL delta equation looks suspiciously familiar to the
Q-learning update formula = maps well to it
In T-D learning more generally we don’t need to be (2)
restricted to 1 time step forward
We can introduce further time steps in future and discount them using discount factor = as further way in future uncertainity increases
TD-0 first, 0-order is closet to
Q-learning
We can extend TD learning to include more states in future e.g., 2 time steps in i.e., - (2) = similar to Q-learning
we can estimate two steps into future and if reward is to be expected we update our V (current estimate value of expected reinforcement)
Same form of update as we had in Q-learning expect it is not on state-action pair updating
Blocking and overshadowing experiments demonstrate that
ot all stimuli present during learning subsequently control behaviour