Week 9: Rescorla-Wagner and Temporal Difference Learning Flashcards

Question

Diagram of 2 stimuli that predict reinforcement model (2)

Answer 1

We have S1,S2 predict reinforcement If this neuron active (reinforcement neuron) ,then we predict reinforcement reward/punishment

Answer 2

expected reinforcement r reward based on all stimuli

Answer 3

Do we calculate delta for each S? (stimuli;CS) Or do we calculate the sum over the Ss

Answer 4

* if a dog is repeatedly exposed to a tone (the first conditioned stimulus, CS1), together with food (the unconditioned stimulus, US), the dog salivates when the tone is presented (conditioned response, CR). * After several consecutive conditioning trials, this time with the tone (CS1) and a light (CS2) together with the US, the dog does not salivate/weak response to the light (CS2) when tested separately later. * Stimulus control by CS2 has then been blocked by the earlier pairing of CS1 with the US

Answer 5

In the example of blocking, the light was pretrained prior to being compounded with the tone, and subjects learned little about the tone-the added element. Sometimes, even If there is no prior training of an element of a compound CS, subjects will still learn little about one ofthe elements. This occurs if one element is more "salient" than the other (other things being equal, a subject trained with a more salient CS will learn more rapidly than a subject trained with a less salient CS). Ifa light CS is more salient than a tone CS, the effect of pairing a UCS with the light + tone compound will be to strongly associate the light with the UCS (food), with little associative strength developing between the tone and the UCS (the light overshadows the tone) There will then be no response to S2 (tone) = salivation

Answer 6

Seond way by taking the difference between reinforcement and expected reinforcement (V) given all stimuli (Reinforcement = general term could be reward or punishment) Producing a single error term for all stimuli (this is known as Rescorla-Wagner rule)

Answer 7

formal model of the circumstances under which Pavlovian conditioning occurs. It attempts to describe the changes in associative strength (V) between a signal (conditioned stimulus, CS) and the subsequent stimulus (unconditioned stimulus, US) as a result of a conditioning trial.

Answer 8

own formula

Answer 9

* Stimulus 1 (CS) in Phase 1 , leads to R (reinforcement = reward/punishment) * Introduce Stimulus 2 (CS) (S2) which predicts S1 * Test the response to S2 * This is 2nd order condtioning

Answer 10

Second-order conditioning (SOC) describes a phenomenon whereby a conditioned stimulus (CS) acquires the ability to elicit a conditioned response (CR) without ever being directly paired with an unconditioned stimulus (US)

Answer 11

For instance, second-order conditioning can be demonstrated using the following procedure: a CS1 (e.g., a light) is paired with a UCS (e.g., food) in phase 1; then CS2 (e.g., a tone) is paired with CS1 (the light) in phase 2. Tested response to S2? This will usually result in a CR relevant to the original UCS (food) being evoked by CS2, even though CS2 has never been directly paired with food (e.g., Rescorla, 1980; Rizley & Rescorla, 1972).

Answer 12

direct associations of S (CS) with R

Answer 13

there is no 'r' in phase 2 as delta rule depends on r! As our delta is difference between received and expected reward

Answer 14

As time of presentation of S2 we don't know if it will lead to reward We don't know past actions were pviotal for good outcome

Answer 15

to consider a time a stimulus/state occured that is predictive of future reward i.e., Linking classical conditoning with RL (reinforcement learning)

Answer 16

temporal difference learning model

Answer 17

We need the value function V at time t (Vt) to predict the sum of future rewards, not just immediate rewards so we can learn S2 predicts S1 which predicts R (reinforcement = general like food reward/punishment)

Answer 18

the sum of all reward at times tau greater than now (present) (at time t) decompose V(t) to current reward and estimate of subsequent reward

Answer 19

delta rule to ensure this happens: Delta at time t is current reward + future reward - expected reinforcement (what I expect) So delta becomes the difference between (expected) reward now and the estimate of all future reward

Answer 20

Delta is based on current reward only as its delta is the difference between current reward (reinforcement) and expected reinforcement given all stimuli

Answer 21

Q-learning update formula = maps well to it

Answer 22

restricted to 1 time step forward We can introduce further time steps in future and discount them using discount factor = as further way in future uncertainity increases

Answer 23

Q-learning

Answer 24

we can estimate two steps into future and if reward is to be expected we update our V (current estimate value of expected reinforcement) Same form of update as we had in Q-learning expect it is not on state-action pair updating

Answer 25

ot all stimuli present during learning subsequently control behaviour

Week 9: Rescorla-Wagner and Temporal Difference Learning Flashcards

(50 cards)