C4 Flashcards
policy-based methods
do not use a separate value function but find the policy directly. They start with a policy function which they improve, episode by episode, with policy gradient methods
why do we need policy-based methods?
for environments with discrete actions, using the action with the best value in a state works well, because it is clearly separate from the next-best action, but for continuous environments this becomes unstable, because with value-based methods small perturbations in Q-values may lead to large changes in the policy
why do stochastic policies not need separate exploration methods?
they perform exploration by their nature, because they return a distribution over actions
what is a potential disadvantage of purely episodic policy-based methods?
they are high variance, they may find local instead of global optima and converge slower than value-based methods
how do policy-based methods learn?
they learn a parameterized policy, that selects actions without having to consult a value function, so the policy function is represented directly, allowing policies to select a continuous action
so, the policy is represented by a set of parameters tau, which map the states S to action probabilities A. We randomly sample a new policy, and if it is better, adjust the parameters in the direction of this new policy
how do we measure the quality of a policy?
the value of the start state:
J(tau) = V^pi (S_0)
how do we maximize the objective function J?
we apply gradient ascent, so in each time step we do this update:
π_{π‘+1} = π_π‘ + πΌ Β· β_π π½ (π)
how is the following update rule derived?
tau_next = tau_now + alpha * Q_hat(s, a) * deriv_tau * log pi_tau (a | s)
from π_{π‘+1} = π_π‘ + πΌ Β· βπ π½ (π), we can fill in this:
π{π‘+1} = π_π‘ + πΌ Β· βπ_π_π‘ (a* | s), because we want to push the parameters in the direction of the optimal action
we donβt know which action is best, but we can take a sample trajectory and use estimates (Q_hat) of the action values of the sample. Then we get π_{π‘+1} = π_π‘ + πΌ * Q_hat(s, a)βπ_π_π‘ (π|π )
Problem: we are going to push harder AND more often on actions with a high value, so they are doubly improved. We fix this by dividing by the general probability π_π (π|π ). The fraction can be written as β_π log π_π (π|π )
This formula is the core of REINFORCE
what is meant with the online approach?
the updates are performed as the timesteps of the trajectory are traversed, so information is used as soon as it is known. This is the opposite of the batch approach, where gradients are summed over the states and actions and updates are performed at the end of the trajectory
name 3 advantages of policy-based methods
- parameterization is at the core of policy-based methods, making them a good match for deep learning (no stability problems)
- they can easily find stochastic policies (value-based find deterministic policies) and there is natural exploration, so no need for epsilon greedy
- they are effective in large or continuous action spaces, small changes in tau also lead to small changes in pi (no suffering from convergence and stability issues)
what are the disadvantages of policy-based methods?
they are high-variance, because a full trajectory is generated randomly (no guidance at each step). Consequences:
- policy improvement happens infrequently, leading to slow convergence compared to value-based methods
- often a local optimum is found, since convergence to the global optimum takes too long
why do we need Actor-Critic bootstrapping?
we want to combine the advantage of the value-based approach (low variance) with the advantage of the policy-based approach (low bias)
bootstrapping gives us a better reward estimate, so it reduces the variance that comes from the cumulative reward estimate. It uses the value function to compute the intermediate n-step values per episode. The n-step values are inbetween full-episode Monte Carlo and single step temporal difference targets. We compute the n-step target (instead of just the trace return):
Q_hat_n (s_t, a_t) = βοΈ(0 to πβ1) π_{π‘+π} + π_π (π _{π‘+π})
and use this improved estimate to update the policy
why baseline subtraction?
it reduces the variance, but leaves the expectation unaffected: we only push up on actions that are higher than the average and push down on actions below average, instead of pushing everyhing up that is positive. As a baseline we choose the value function. We obtain the advantage function:
A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
it estimates how much better a particular action is, compared to the expectation of a particular state
we can now fill in the estimated cumulative reward and use the estimated advantage to update the policy
what is A3C?
multiple actor-learners are dispatched to separate instantiations of the environment. they all interact with the environment and collect experience and asynchronously push their gradient updates to a central target network. This has a stabilizing effect on training
what is TRPO?
Trust Region Policy Optimization: it aims to further reduce the high variability in the policy parameters.
We want to take the largest possible improvement step on a policy parameter without causing performance collapse: we use an adaptive step size that depends on the output of the optimization progress.
If the quality of approximation still good, we can expand the region, if the divergence between the new and current policy gets large we shrink it