C4 Flashcards

Question 1

Q

policy-based methods

Answer

A

do not use a separate value function but find the policy directly. They start with a policy function which they improve, episode by episode, with policy gradient methods

Question 2

Q

why do we need policy-based methods?

Answer

A

for environments with discrete actions, using the action with the best value in a state works well, because it is clearly separate from the next-best action, but for continuous environments this becomes unstable, because with value-based methods small perturbations in Q-values may lead to large changes in the policy

Question 3

Q

why do stochastic policies not need separate exploration methods?

Answer

A

they perform exploration by their nature, because they return a distribution over actions

Question 4

Q

what is a potential disadvantage of purely episodic policy-based methods?

Answer

A

they are high variance, they may find local instead of global optima and converge slower than value-based methods

Question 5

Q

how do policy-based methods learn?

Answer

A

they learn a parameterized policy, that selects actions without having to consult a value function, so the policy function is represented directly, allowing policies to select a continuous action

so, the policy is represented by a set of parameters tau, which map the states S to action probabilities A. We randomly sample a new policy, and if it is better, adjust the parameters in the direction of this new policy

Question 6

Q

how do we measure the quality of a policy?

Answer

A

the value of the start state:
J(tau) = V^pi (S_0)

Question 7

Q

how do we maximize the objective function J?

Answer

A

we apply gradient ascent, so in each time step we do this update:
𝜃_{𝑡+1} = 𝜃_𝑡 + 𝛼 · ∇_𝜃 𝐽 (𝜃)

Question 8

Q

how is the following update rule derived?
tau_next = tau_now + alpha * Q_hat(s, a) * deriv_tau * log pi_tau (a | s)

Answer

A

from 𝜃_{𝑡+1} = 𝜃_𝑡 + 𝛼 · ∇𝜃 𝐽 (𝜃), we can fill in this:
𝜃{𝑡+1} = 𝜃_𝑡 + 𝛼 · ∇𝜋_𝜃_𝑡 (a* | s), because we want to push the parameters in the direction of the optimal action

we don’t know which action is best, but we can take a sample trajectory and use estimates (Q_hat) of the action values of the sample. Then we get 𝜃_{𝑡+1} = 𝜃_𝑡 + 𝛼 * Q_hat(s, a)∇𝜋_𝜃_𝑡 (𝑎|𝑠)

Problem: we are going to push harder AND more often on actions with a high value, so they are doubly improved. We fix this by dividing by the general probability 𝜋_𝜃 (𝑎|𝑠). The fraction can be written as ∇_𝜃 log 𝜋_𝜃 (𝑎|𝑠)

This formula is the core of REINFORCE

Question 9

Q

what is meant with the online approach?

Answer

A

the updates are performed as the timesteps of the trajectory are traversed, so information is used as soon as it is known. This is the opposite of the batch approach, where gradients are summed over the states and actions and updates are performed at the end of the trajectory

Question 10

Q

name 3 advantages of policy-based methods

Answer

A

parameterization is at the core of policy-based methods, making them a good match for deep learning (no stability problems)
they can easily find stochastic policies (value-based find deterministic policies) and there is natural exploration, so no need for epsilon greedy
they are effective in large or continuous action spaces, small changes in tau also lead to small changes in pi (no suffering from convergence and stability issues)

Question 11

Q

what are the disadvantages of policy-based methods?

Answer

A

they are high-variance, because a full trajectory is generated randomly (no guidance at each step). Consequences:

policy improvement happens infrequently, leading to slow convergence compared to value-based methods
often a local optimum is found, since convergence to the global optimum takes too long

Question 12

Q

why do we need Actor-Critic bootstrapping?

Answer

A

we want to combine the advantage of the value-based approach (low variance) with the advantage of the policy-based approach (low bias)

bootstrapping gives us a better reward estimate, so it reduces the variance that comes from the cumulative reward estimate. It uses the value function to compute the intermediate n-step values per episode. The n-step values are inbetween full-episode Monte Carlo and single step temporal difference targets. We compute the n-step target (instead of just the trace return):
Q_hat_n (s_t, a_t) = ∑︁(0 to 𝑛−1) 𝑟_{𝑡+𝑘} + 𝑉_𝜙 (𝑠_{𝑡+𝑛})
and use this improved estimate to update the policy

Question 13

Q

why baseline subtraction?

Answer

A

it reduces the variance, but leaves the expectation unaffected: we only push up on actions that are higher than the average and push down on actions below average, instead of pushing everyhing up that is positive. As a baseline we choose the value function. We obtain the advantage function:
A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
it estimates how much better a particular action is, compared to the expectation of a particular state

we can now fill in the estimated cumulative reward and use the estimated advantage to update the policy

Question 14

Q

what is A3C?

Answer

A

multiple actor-learners are dispatched to separate instantiations of the environment. they all interact with the environment and collect experience and asynchronously push their gradient updates to a central target network. This has a stabilizing effect on training

Question 15

Q

what is TRPO?

Answer

A

Trust Region Policy Optimization: it aims to further reduce the high variability in the policy parameters.

We want to take the largest possible improvement step on a policy parameter without causing performance collapse: we use an adaptive step size that depends on the output of the optimization progress.

If the quality of approximation still good, we can expand the region, if the divergence between the new and current policy gets large we shrink it

Question 16

Q

what is soft actor critic (SAC)?

Answer

Study These Flashcards

A

it uses entropy regularization in the update rule, to insure that we move to the optimal policy, while also ensuring that the policy stays as wide as possible

it also uses a replay buffer

Question 17

Q

what is PPO?

Answer

Study These Flashcards

A

proximal policy optimization: a simpler version of TRPO with better complexity

C4 Flashcards

(17 cards)