10 - Policy Optimization Flashcards

1
Q

Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from _____________ to _____________ policy estimates.

A

Value Function Approximation (VFA) goes from tabular to parameterized value estimates. Policy Search goes from tabular to parameterized policy estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Policy search typically results in __________ {low, high} variance estimates of policy.

A

Policy search typically results in high variance estimates of policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not ____________.

A

In general, one of the biggest reasons to choose gradient-free optimization, over gradient-based optimization, is the error function is not continuous (and convex).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One way to improve policy search is to use domain knowledge to ____________ the space of possible policies. (from ~26:25)

A

One way to improve policy search is to use domain knowledge to decrease the space of possible policies. (from ~26:25)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In order to analytically calculate the gradient, the policy has to be ____________ whenever it is non-zero.

A

In order to analytically calculate the gradient, the policy has to be defined (or be numerical) whenever it is non-zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 3 steps in all RL algorithms?

  1. __________ the policy
  2. __________ the policy
  3. __________ the policy
A
  1. Run the policy: actually act in env
  2. Evaluate the policy: estimate Všœ‹ or Q*
  3. Improve the policy: do something which lets you pick better actions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Briefly, what is the log-derivative trick?

A

It is a way of deriving the policy gradient by taking the derivative of log probability, which gives us the score function and has the nice property that its expectation is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Briefly, what is the advantage function?

A

This is the difference between the Q value of a given state-action pair and the value function of that state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Policy gradient methods are ______________ {off policy, on policy}.

A

Policy gradient methods are on policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The estimated value of policy is typically a ___________ {regression, classification} supervised machine learning problem.

A

The estimated value of policy is typically a regression supervised machine learning problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The bias-variance trade-off for policy gradient methods can be managed with _____________ .

A

The bias-variance trade-off for policy gradient methods can be managed with choice of n (our steps).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly