10 - Policy Optimization Flashcards
Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from _____________ to _____________ policy estimates.
Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from tabular to parameterized policy estimates.
Policy search typically results in __________ {low, high} variance estimates of policy.
Policy search typically results in high variance estimates of policy.
In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not ____________.
In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not continuous (and convex).
One way to improve policy search is to use domain knowledge to ____________ the space of possible policies. (from ~26:25)
One way to improve policy search is to use domain knowledge to decrease the space of possible policies. (from ~26:25)
In order to analytically calculate the gradient, the policy has to be ____________ whenever it is non-zero.
In order to analytically calculate the gradient, the policy has to be defined (or be numerical) whenever it is non-zero.
What are the 3 steps in all RL algorithms?
- __________ the policy
- __________ the policy
- __________ the policy
- Run the policy: actually act in env
- Evaluate the policy: estimate Vš or Q*
- Improve the policy: do something which lets you pick better actions
Briefly, what is the log-derivative trick?
It is a way of deriving the policy gradient by taking the derivative of log probability, which gives us the score function and has the nice property that its expectation is 0.
Briefly, what is the advantage function?
This is the difference between the Q value of a given state-action pair and the value function of that state.
Policy gradient methods are ______________ {off policy, on policy}.
Policy gradient methods are on policy.
The estimated value of policy is typically a ___________ {regression, classification} supervised machine learning problem.
The estimated value of policy is typically a regression supervised machine learning problem.
The bias-variance trade-off for policy gradient methods can be managed with _____________ .
The bias-variance trade-off for policy gradient methods can be managed with choice of n (our steps).