8.1 Temporal Difference Learning Flashcards
Credit assignment problem
Example: Rat in maze (maze has one branch point)
One decision, one step to outcome, one observation outcome pair. Learning depends on knowing the outcome
^ something we can do with RW model, but learning only happens when we found outcome of decision in this model
However, most mazes are much more complex and there are multiple decisions. Rat only finds out if it’s correct at the end. How do we learn from the early choices?
Other examples:
- Game of chess. Many moves, but winner only decided at end
- Predicting the weather next week. We continually update the prediction as the sky changes.
- Almost anything we do in real life…
Summary: How do we evaluate the value of intermediate decisions when we only find out at the end if they are beneficial or not?
Temporal difference learning
Equation:
DeltaV_t = a * (R_(t+1) + y * V_(t+1) - V_t)
Similar to RW model. We model change of learning, learning rate, expectation at time t or trial n.
DIFFERENCE:
RW - surprise is difference of what we observe and expect
TDL - Combination of obtained at t+1 and expected at t+1. We learn from the changes of our expectations
Example:
Thursday, we have idea of weather for Friday (V_t and V_(t+1))
Friday, we have idea, but we also observe the Friday weather. (R_(t+1) and V_(t+1))
TDL change retrospectively our model of how the weather works by changing deltaV_t
deltaVt = change in estimated value at time t a = learning rate R_(t+1) = Reward at time t+1 V_(t+1) = Estimated value at time t+1 V_t = = Estimated value at time t
Summary: RW vs TDL (more like real life) - One step vs Multiple steps - Discrete vs continuous - Per trial vs temporal - Outcome vs changes in expectation
Dopamine codes prediction errors
Dopamine neurons mimic the error function from TDL
Experiment conducted with monkeys with micro-electrodes
Tone -2s-> Juice Response depends on phases: - Before learning Firing of dopamine neurons increases after reward \_\_\_\_\_/\_
- After learning
Signal migrates to just after tone, and there is no response after juice. It’s like satisfaction of juice is received at cue, not the actual reward
__/\_____ - After learning with No juice
Early response to tone, the negative response due to absence of expected juice
__/\__ __
\/
Cue becomes more important than the rewards, Expectation is what matters here.
Circling from value to learning
Reward and probability coding
Even more studies that support this neuron link
Experiment conducted with monkeys with micro-electrodes
Image -2s-> Juice, but chance of juice and amount is changed with probability based on picture. Manipulate amount of reward and likelihood
Conclusion: Response is proportional to reward times probability
Delay coding
Dopamine neurons also code for delay
Rats
Odours | Delayed rewards
Somewhat predictive of the reward and delay
Multiple odours -variable-> Reward (multiple delays)
Conclusion: Response to anticipated reward is influenced by temporal discounting
Response on reward delivery is independent of the delay
Review Graph
Why is there still a response on delivery even though there is a delay?
Because the odour is only somewhat predictive of the reward
Intrinsic reward for learning
Learning is satisfying
Computer simulation of learning and evolution:
- Rewards linked to fitness vs
- Rewards for learning
Simulation showed agents evolved reward functions that were not directly related to fitness. They adapted individually to their environment through learning based on their reward function
The reward functions evolved globally at the level of the population to value learning
Most likely answer: Maximises long-term evolutionary fitness in changing environments
There is a parallel here with the distinction between RW and TDF learning. What did we say we learn from in each case?
RW: Outcomes -> survival
TDF: Changes in expectation -> learning
Simulation of task choice
What does intrinsic reward for learning mean?
Amount of improvement of the prediction error, decrease in surprise = deltaV
4 Types of activities (Graph description of Errors in prediction): 1 Too easy --------- low 2 Too difficult ----------- high 3 Initial task S-curve steep 4 Next task S-curve slight
Review graph on errors in prediction
Review % of time spent in each activity graph 1 Too easy: Low time 2 Too difficult: Low time 3 Initial task: High time, then low 4 Next task: Low time, then high
We spend time on something when learning is high, but as it slows down we move onto something else
This can be challenging when we want to master something