Reinforcement Learning: Actor/Critic Flashcards
What is the problem being solved in reinforcement learning?
how to assign credit over a sequence of actions leading to cumulative reward
What is the role of the teacher?
The role of the teacher in reinforcement learning tasks is more evaluative than instructional, and the teacher is sometimes called a critic because of this role
What does the critic provide?
evaluations of the learning system’s actions as training information, leaving it to the learning system to determine how to modify its actions so as to obtain better evaluations in the future.
What does the critic not do?
does not tell the learning system what to do to improve performance
What does reinforcement learning methods have to incorporate due to the role of the critic?
have to incorporate an additional exploration process that can be used to discover the appropriate actions to store.
What did the adaptive critic element construct?
an evaluation of different states of the environment, using a temporal difference-like learning rule from which the TD learning rule was later developed
How was ACE evaluation used in reinforcement learning?
used to augment the external reinforcement signal and train through a trial-and error process a second unit, the “associative search element (ASE)”, to select the correct action at each state
What insight first gave way to the ACE-ASE model?
Sutton, 1978
even when the external reinforcement for a task is delayed (as when playing checkers), a temporal difference prediction error can convey, at every timestep, a surrogate ‘reinforcement’ signal that embodies both immediate outcomes and future prospects, to the action just chosen
What happens ini the absence of external reinforcement in the ACE-ASE model?
in the absence of external reinforcement (ie,rt = 0), the prediction error δt becomes γV(St+1)−V(St), that is, it compares the values of two consecutive states and conveys information regarding whether the chosen action has led to a state with a higher value than the previous state (ie, to a state predictive of more future reward) or not
What are the implications of state changes in the ACE-ASE model?
whenever a positive prediction error is encountered, the current action has improved prospects for future rewards, and should be repeated
What happens when there is a negative prediction error?
The opposite is true for negative prediction errors, which signal that the action should be chosen less often in the future.
What do prediction errors allow the agent to do?
Thus the agent can learn an explicit policy – a probability distribution over all available actions at each state π(S,a) = p(a|S), by using the following learning rule at every timestep
Write the equation for the ACE-ASE model
π(S,a)new = π(S,a) old + ηπδt
where ηπ is the policy learning rate and δt is the prediction error from equation
What does the critic use in the actor/critic model?
a Critic module uses TD learning to estimate state values V(S) from experience with the environment, and the same TD prediction error is also used to train the Actor module, which maintains and learns a policy π
What has the actor/critic model been related to?
to policy improvement methods in dynamic programming (Sutton, 1988), and Williams (1992)