Reinforcement Learning: Actor/Critic Flashcards
What is the problem being solved in reinforcement learning?
how to assign credit over a sequence of actions leading to cumulative reward
What is the role of the teacher?
The role of the teacher in reinforcement learning tasks is more evaluative than instructional, and the teacher is sometimes called a critic because of this role
What does the critic provide?
evaluations of the learning system’s actions as training information, leaving it to the learning system to determine how to modify its actions so as to obtain better evaluations in the future.
What does the critic not do?
does not tell the learning system what to do to improve performance
What does reinforcement learning methods have to incorporate due to the role of the critic?
have to incorporate an additional exploration process that can be used to discover the appropriate actions to store.
What did the adaptive critic element construct?
an evaluation of different states of the environment, using a temporal difference-like learning rule from which the TD learning rule was later developed
How was ACE evaluation used in reinforcement learning?
used to augment the external reinforcement signal and train through a trial-and error process a second unit, the “associative search element (ASE)”, to select the correct action at each state
What insight first gave way to the ACE-ASE model?
Sutton, 1978
even when the external reinforcement for a task is delayed (as when playing checkers), a temporal difference prediction error can convey, at every timestep, a surrogate ‘reinforcement’ signal that embodies both immediate outcomes and future prospects, to the action just chosen
What happens ini the absence of external reinforcement in the ACE-ASE model?
in the absence of external reinforcement (ie,rt = 0), the prediction error δt becomes γV(St+1)−V(St), that is, it compares the values of two consecutive states and conveys information regarding whether the chosen action has led to a state with a higher value than the previous state (ie, to a state predictive of more future reward) or not
What are the implications of state changes in the ACE-ASE model?
whenever a positive prediction error is encountered, the current action has improved prospects for future rewards, and should be repeated
What happens when there is a negative prediction error?
The opposite is true for negative prediction errors, which signal that the action should be chosen less often in the future.
What do prediction errors allow the agent to do?
Thus the agent can learn an explicit policy – a probability distribution over all available actions at each state π(S,a) = p(a|S), by using the following learning rule at every timestep
Write the equation for the ACE-ASE model
π(S,a)new = π(S,a) old + ηπδt
where ηπ is the policy learning rate and δt is the prediction error from equation
What does the critic use in the actor/critic model?
a Critic module uses TD learning to estimate state values V(S) from experience with the environment, and the same TD prediction error is also used to train the Actor module, which maintains and learns a policy π
What has the actor/critic model been related to?
to policy improvement methods in dynamic programming (Sutton, 1988), and Williams (1992)
What is a specific example of the use of actor/critic models?
Sutton et al. (2000) have shown that in some cases the Actor/Critic can be construed as a gradient climbing algorithm for learning a parameterized policy, which converges to a local maximum (see also Dayan & Abbott, 2001)
What are the limitations of actor/critic models?
in the general case Actor/Critic methods are not guaranteed to converge on an optimal behavioral policy (cf. Baird, 1995; Konda & Tsitsiklis, 2003)
Biological plausibility of actor/critic models
some of the strongest links between RL methods and neurobiological data regarding animal and human decision making have been related to the Actor/Critic framework
What have actor/critic models been used to study in animals?
Actor/Critic methods have been extensively linked to instrumental action selection and Pavlovian prediction learning in the basal ganglia (eg. Barto, 1995; Houk et al., 1995; Joel et al., 2002)