Week 9: Rescorla-Wagner and Temporal Difference Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

There are other related model-free algorithms

A

such as the temporal difference learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Temporal Difference Learning (2)

A
  • It is also a model-free RL algorithm
  • It is different from Q-learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Definition of classical conditoning

A

a learning process that occurs when two stimuli are repeatedly paired; a response that is at first elicited by the second stimulus is eventually elicited by the first stimulus alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

UCS

A

Unconditioned stimulus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

UCS is a stimulus that

A

leads to an automatic response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Neutral stimulus is a

A

stimulus that does not trigger a response on its own

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Conditioned stimulus is

A

a stimulus that was once neutral (did not trigger a response) but now leads to a response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unconditioned response (UCR)

A

is an automatic response tha occurs without thought when an unconditioned stimulsu is present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Conditioned response (CR)

A

Is a learned response created where no response existed before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Pavlov’s Dog Experiment (4)

A
  • Before conditioning, Dog was presented with food (UCS) that lead to automatically trigger a salivation response (UCR)
  • Before condtioning, Dog heard a bell rang (NS) which lead to no response from dog
  • During conditoning, dog presented with food (UCS) and sound of bell (NS) which lead to salivation (UCR)
  • After conditioning, the dog salivated (CR) when he heard the bell rang (CS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Diagram of Pavlov’s dog: classical conditioning experiment,

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Now we view S as

A

conditoned stimulus (bell after pairing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Do not confuse ‘S’ as

A

state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Now we view R as

A

reinforcement (i.e., the food)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Table of acqusition, extinction, partial reinforcement in classical conditoning Diagram

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Acqusition in table is where (3)

A

S (CS) is paired with reward in Phase 1
Nothing in Phase 2
Then we get response to S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Extinction in classical condtioning table outcome is where (3)

A

S (CS) paired with R In Phase 1
Present S (CS) on its own in Phase 2
Then see no response to S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Partial Reinforcement in table of classical condtioning (2)

A

where we ocassionally present S with R
Lead to weak response to S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Simple way to model that table of classical condtioning (CC) (of acquisition, extinction and partial reinforcement is) (2)

A

Stimulus neuron which has an input weight to a reward neuron

If the r neuron is active, we predict reinforcement reward/punishment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Use a simple delta-rule model in simple model of CC (of acqusition, extinction and partial reinforcement)

If S stimulus is present… (2)

A

S=1 (S=0 if not present):

then update weight: w → w + εSδ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

δ

A

is delta

22
Q

ε

A

epilson , a very small number

23
Q

What is δ in simple delta-rule model of simple model of CC

w → w + εSδ

(4)

A

δ = r - wS,

i.e. the difference between actual reward and predicted (wS) reward.

This is error-driven learning = changing w such delta = 0 in presence of stimulus S

The error between predicted and actual reward is 0 so prediction is perfect and know what to expect

24
Q

Diagram of graph of simple model of CC using simple delta rule model ( explaining acqusition, extinction and partial reinforcement (3)

A

When the weight grows slowly = acqusition phase

When we do not present rewward the weight decays = extinction.

Partial reinforcement

25
Q

Diagram of 2 stimuli that predict reinforcement model (2)

A

We have S1,S2 predict reinforcement

If this neuron active (reinforcement neuron) ,then we predict reinforcement reward/punishment

26
Q

V is the

A

expected reinforcement r reward based on all stimuli

27
Q

Two ways to model that 2 stimuli predict reinforcement? (2)

A

Do we calculate delta for each S? (stimuli;CS)

Or do we calculate the sum over the Ss

28
Q

Blocking experiments (3) by Kamin 1969

A
  • if a dog is repeatedly exposed to a tone (the first conditioned stimulus, CS1), together with food (the unconditioned stimulus, US), the dog salivates when the tone is presented (conditioned response, CR).
  • After several consecutive conditioning trials, this time with the tone (CS1) and a light (CS2) together with the US, the dog does not salivate/weak response to the light (CS2) when tested separately later.
  • Stimulus control by CS2 has then been blocked by the earlier pairing of CS1 with the US
29
Q

Overshadowing vs Blocking

Overshadowing experiments (5)

A

In the example of blocking, the light was pretrained prior to being compounded with the tone, and subjects learned little about the tone-the added element.

Sometimes, even If there is no prior training of an element of a compound CS, subjects will still learn little about one ofthe elements.

This occurs if one element is more “salient” than the other (other things being equal, a subject trained with a more salient CS will learn more rapidly than a subject trained with a less salient CS).

Ifa light CS is more salient than a tone CS, the effect of pairing a UCS with the light + tone compound will be to strongly associate the light with the UCS (food), with little associative strength developing between the tone and the UCS (the light overshadows the tone)

There will then be no response to S2 (tone) = salivation

30
Q

CC (classical conditioning) can be modelled with

A

R-W rule

31
Q

Blocking and overshadowing experiments indicate how to model 2 stimuli predicting reinforcement is (3)

A

Seond way by taking the difference between reinforcement and expected reinforcement (V) given all stimuli

(Reinforcement = general term could be reward or punishment)

Producing a single error term for all stimuli (this is known as Rescorla-Wagner rule)

32
Q

What is Rescorla-Wagner Model? (2)

A

formal model of the circumstances under which Pavlovian conditioning occurs.

It attempts to describe the changes in associative strength (V) between a signal (conditioned stimulus, CS) and the subsequent stimulus (unconditioned stimulus, US) as a result of a conditioning trial.

33
Q

Expected reinforcement , V, has its

A

own formula

34
Q

What about temporal sequence of stimuli such that (4)

A
  • Stimulus 1 (CS) in Phase 1 , leads to R (reinforcement = reward/punishment)
  • Introduce Stimulus 2 (CS) (S2) which predicts S1
  • Test the response to S2
  • This is 2nd order condtioning
35
Q

What is 2nd order conditoning?

A

Second-order conditioning (SOC) describes a phenomenon whereby a conditioned stimulus (CS) acquires the ability to elicit a conditioned response (CR) without ever being directly paired with an unconditioned stimulus (US)

36
Q

2nd order conditioning example (5)

A

For instance, second-order conditioning can be demonstrated using the following procedure:

a CS1 (e.g., a light) is paired with a UCS (e.g., food) in phase 1;

then CS2 (e.g., a tone) is paired with CS1 (the light) in phase 2.

Tested response to S2?

This will usually result in a CR relevant to the original UCS (food) being evoked by CS2, even though CS2 has never been directly paired with food (e.g., Rescorla, 1980; Rizley & Rescorla, 1972).

37
Q

The Rescorla-Wagner rule only works for

A

direct associations of S (CS) with R

38
Q

The R-W rule does not work in temporal sequence of stimuli since (2)

A

there is no ‘r’ in phase 2 as delta rule depends on r!

As our delta is difference between received and expected reward

39
Q

The problem of R-W does not work with temporal sequence of stimuli relates to temporal credit assignment problem again (2)

A

As time of presentation of S2 we don’t know if it will lead to reward

We don’t know past actions were pviotal for good outcome

40
Q

In order to know which of the past actions was pviotal for a good outcome we need (temporal credit assignment problem) - (2)

A

to consider a time a stimulus/state occured that is predictive of future reward

i.e., Linking classical conditoning with RL (reinforcement learning)

41
Q

Model that links RL and CC together is called

A

temporal difference learning model

42
Q

In temporal difference learning model we would need the

A

We need the value function V at time t (Vt) to predict the sum of future rewards, not just immediate rewards so we can learn S2 predicts S1 which predicts R (reinforcement = general like food reward/punishment)

43
Q

In temporal difference learning model we would need the V(t) to predict the sum of future reward, not just immediate r(t) so we can learn S2 => S1 => R

i.e., we want (2)

A

the sum of all reward at times tau greater than now (present) (at time t)

decompose V(t) to current reward and estimate of subsequent reward

44
Q

To sum of all reward at times taou greater than now (at time t) we need to do this by using the (3) - Temporal Difference Learning

A

delta rule to ensure this happens:

Delta at time t is current reward + future reward - expected reinforcement (what I expect)

So delta becomes the difference between (expected) reward now and the estimate of all future reward

45
Q

Comparing delta rule in temporal difference learning to R-W rule

A

Delta is based on current reward only as its delta is the difference between current reward (reinforcement) and expected reinforcement given all stimuli

46
Q

Our TDL delta equation looks suspiciously familiar to the

A

Q-learning update formula = maps well to it

47
Q

In T-D learning more generally we don’t need to be (2)

A

restricted to 1 time step forward

We can introduce further time steps in future and discount them using discount factor = as further way in future uncertainity increases

48
Q

TD-0 first, 0-order is closet to

A

Q-learning

49
Q

We can extend TD learning to include more states in future e.g., 2 time steps in i.e., - (2) = similar to Q-learning

A

we can estimate two steps into future and if reward is to be expected we update our V (current estimate value of expected reinforcement)

Same form of update as we had in Q-learning expect it is not on state-action pair updating

50
Q

Blocking and overshadowing experiments demonstrate that

A

ot all stimuli present during learning subsequently control behaviour