Reinforcement Learning: Actor/Critic Flashcards

Question 1

Q

What is the problem being solved in reinforcement learning?

Answer

A

how to assign credit over a sequence of actions leading to cumulative reward

Question 2

Q

What is the role of the teacher?

Answer

A

The role of the teacher in reinforcement learning tasks is more evaluative than instructional, and the teacher is sometimes called a critic because of this role

Question 3

Q

What does the critic provide?

Answer

A

evaluations of the learning system’s actions as training information, leaving it to the learning system to determine how to modify its actions so as to obtain better evaluations in the future.

Question 4

Q

What does the critic not do?

Answer

A

does not tell the learning system what to do to improve performance

Question 5

Q

What does reinforcement learning methods have to incorporate due to the role of the critic?

Answer

A

have to incorporate an additional exploration process that can be used to discover the appropriate actions to store.

Question 6

Q

What did the adaptive critic element construct?

Answer

A

an evaluation of different states of the environment, using a temporal difference-like learning rule from which the TD learning rule was later developed

Question 7

Q

How was ACE evaluation used in reinforcement learning?

Answer

A

used to augment the external reinforcement signal and train through a trial-and error process a second unit, the “associative search element (ASE)”, to select the correct action at each state

Question 8

Q

What insight first gave way to the ACE-ASE model?

Answer

A

Sutton, 1978

even when the external reinforcement for a task is delayed (as when playing checkers), a temporal difference prediction error can convey, at every timestep, a surrogate ‘reinforcement’ signal that embodies both immediate outcomes and future prospects, to the action just chosen

Question 9

Q

What happens ini the absence of external reinforcement in the ACE-ASE model?

Answer

A

in the absence of external reinforcement (ie,rt = 0), the prediction error δt becomes γV(St+1)−V(St), that is, it compares the values of two consecutive states and conveys information regarding whether the chosen action has led to a state with a higher value than the previous state (ie, to a state predictive of more future reward) or not

Question 10

Q

What are the implications of state changes in the ACE-ASE model?

Answer

A

whenever a positive prediction error is encountered, the current action has improved prospects for future rewards, and should be repeated

Question 11

Q

What happens when there is a negative prediction error?

Answer

A

The opposite is true for negative prediction errors, which signal that the action should be chosen less often in the future.

Question 12

Q

What do prediction errors allow the agent to do?

Answer

A

Thus the agent can learn an explicit policy – a probability distribution over all available actions at each state π(S,a) = p(a|S), by using the following learning rule at every timestep

Question 13

Q

Write the equation for the ACE-ASE model

Answer

A

π(S,a)new = π(S,a) old + ηπδt

where ηπ is the policy learning rate and δt is the prediction error from equation

Question 14

Q

What does the critic use in the actor/critic model?

Answer

A

a Critic module uses TD learning to estimate state values V(S) from experience with the environment, and the same TD prediction error is also used to train the Actor module, which maintains and learns a policy π

Question 15

Q

What has the actor/critic model been related to?

Answer

A

to policy improvement methods in dynamic programming (Sutton, 1988), and Williams (1992)

Question 16

Q

What is a specific example of the use of actor/critic models?

Answer

Study These Flashcards

A

Sutton et al. (2000) have shown that in some cases the Actor/Critic can be construed as a gradient climbing algorithm for learning a parameterized policy, which converges to a local maximum (see also Dayan & Abbott, 2001)

Question 17

Q

What are the limitations of actor/critic models?

Answer

Study These Flashcards

A

in the general case Actor/Critic methods are not guaranteed to converge on an optimal behavioral policy (cf. Baird, 1995; Konda & Tsitsiklis, 2003)

Question 18

Q

Biological plausibility of actor/critic models

Answer

Study These Flashcards

A

some of the strongest links between RL methods and neurobiological data regarding animal and human decision making have been related to the Actor/Critic framework

Question 19

Q

What have actor/critic models been used to study in animals?

Answer

Study These Flashcards

A

Actor/Critic methods have been extensively linked to instrumental action selection and Pavlovian prediction learning in the basal ganglia (eg. Barto, 1995; Houk et al., 1995; Joel et al., 2002)

Reinforcement Learning: Actor/Critic Flashcards

(19 cards)