SECTION 5: Reinforcement Learning Flashcards
Describe Reinforcement learning in relation to: Pavlovian conditioning (2p)
In reinforcement learning the agent/network learns to predict a reward from a stimulus. After training it learns that a certain stimuli is followed by a certain reward. This is similar to Pavlovian conditioning, although this also deals with conditioning for punishment.
Describe Reinforcement learning in relation to:
Dopamine (prediction) error activity in the brain (2p)
Dopamine is the biological brain’s “reward-neurotransmitter”. The dopamin and error activity in a reinforcement network follows the same patterns. S-stimuli, R-reward.
Before training, the level of dopamine rises when the reward (R) arrives. And so does the error, since the network did not predict a coming reward.
After training the dopamine levels and the error rises at the time of the presented (conditioned) stimulus (S), because net network has now learned that S means R.
When, after conditioning, R is taken away there will follow a negative dip in dopamine levels when realised no reward is present. (However, the rise in dopamine will still come when presented with S).
Describe the temporal difference (TD) learning algorithm and write up the TD prediction error equation. (3p)
δt = rt + γ ν(St + 1) - ν(St)
TD is a development of the Rescorla-Wagner algorithm. Like the RW it means that learning happens when the predicted reward does not equal the actual reward. But the TD also take into account that the time is a factor that might change the value of a reward. A reward now is better than tomorrow, for example. So the TD use a discontation term (gamma) to evaluate how much a rewards value change over time.
What is the significance of the gamma term in the equation for learning the value function (that valuates states / time steps between the ‘start’ signal and the reward/’goal’ signal? (2p)
If gamma is high it doesn’t mind waiting for the reward. The states before the reward are valued higher.
If gamma is lower it doesn’t want to wait for the reward. What the reward loses is valued over time, and it might be better to choose a reward closer in time/space that is lower.
TD learning can also be applied to action selection architectures. One such architecture is the Actor-Critic architecture. Describe how the TD prediction error is used in this architecture. (2p)
The Actor in this model is like a module that chooses what to chose in a certain state. The rules for action choice is called policy. The Actor can change a faulty policy to a better, by receiving information about error from the Critic. The Critic evaluates actions and uses the TD learning to calculate an error from the predicted outcome of an action and the actual outcome. This error information is used to update the Actors policy. So the next time the Actor is in the same state it will behave according to the new policy instead, which resembles how humans and animals can regulate their behaviour according to new experience.
Describe how a ‘deep’ reinforcement learning network (DRLN), i.e. one that uses a multi-layer perceptron, can increase the capability of a reinforcement learning architecture (like Actor-Critic or Q-Learning based) to make actions in real world environments (i.e. not simple grid worlds). (3p)
A DRLN works in dynamic “real” worlds since the capacity of state evaluation function and TD error-functions increases as the numbers of hidden nodes in the network increases. DRLN is therefore better at evaluating states, which is highly important in the real world as the states are more dynamic and does not apply to as clear boundaries as in a simple grid world. Since these states are more unclear one must also have a more complex function to care for “time” since “time” in the grid world is classified as “1 time step = 1 state transition”
Describe how using a multi-layer perceptron for processing the state space, can increase the capability of a reinforcement learning architecture (like Actor-Critic or Q-Learning based) to make actions in real world environments with many features (i.e. not simple grid worlds as shown above). (3p)
?