10 - Policy Optimization Flashcards

Question 1

Q

Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from _____________ to _____________ policy estimates.

Answer

A

Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from tabular to parameterized policy estimates.

Question 2

Q

Policy search typically results in __________ {low, high} variance estimates of policy.

Answer

A

Policy search typically results in high variance estimates of policy.

Question 3

Q

In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not ____________.

Answer

A

In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not continuous (and convex).

Question 4

Q

One way to improve policy search is to use domain knowledge to ____________ the space of possible policies. (from ~26:25)

Answer

A

One way to improve policy search is to use domain knowledge to decrease the space of possible policies. (from ~26:25)

Question 5

Q

In order to analytically calculate the gradient, the policy has to be ____________ whenever it is non-zero.

Answer

A

In order to analytically calculate the gradient, the policy has to be defined (or be numerical) whenever it is non-zero.

Question 6

Q

What are the 3 steps in all RL algorithms?

__________ the policy
__________ the policy
__________ the policy

Answer

A

Run the policy: actually act in env
Evaluate the policy: estimate V𝜋 or Q*
Improve the policy: do something which lets you pick better actions

Question 7

Q

Briefly, what is the log-derivative trick?

Answer

A

It is a way of deriving the policy gradient by taking the derivative of log probability, which gives us the score function and has the nice property that its expectation is 0.

Question 8

Q

Briefly, what is the advantage function?

Answer

A

This is the difference between the Q value of a given state-action pair and the value function of that state.

Question 9

Q

Policy gradient methods are ______________ {off policy, on policy}.

Answer

A

Policy gradient methods are on policy.

Question 10

Q

The estimated value of policy is typically a ___________ {regression, classification} supervised machine learning problem.

Answer

A

The estimated value of policy is typically a regression supervised machine learning problem.

Question 11

Q

The bias-variance trade-off for policy gradient methods can be managed with _____________ .

Answer

A

The bias-variance trade-off for policy gradient methods can be managed with choice of n (our steps).