CS7642_Week3 Flashcards

1
Q

What is the difference between RL with control and RL without control?

A

Control in RL just means that actions are being chosen by the learner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a contraction mapping do when applied to to functions F and G?

A

It brings them closer together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What three things need to hold for the convergence theorem to guarantee that Q converges in the limit?

A
  1. Averaging over noisy transitions
  2. Update operator must be a non-expansion (i.e. a contraction. This is calculation that is essentially the one-step lookahead that we did in value iteration
  3. Learning rate sequence must sum to infinity, sum of squares must be finite.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How might we compute how much we do (or don’t) care about future rewards?

A

It’s proportional to something like H ~ 1 / (1 - gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it not a good idea to set gamma to very small values?

A

You end up with an agent that acts myopically, always seeking immediate gratification rather than playing “the long game”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are three important features of policy iteration (PI)? What is the tradeoff we make when using PI over VI? What is the most important feature of PI?

A
  1. Q –> Q* in the limit
  2. Convergence is exact and complete in finite time (assuming use of the greedy policy)
  3. Computes at least a fast as value iteration (VI)

The downside of PI is computational complexity because now we have to iterate over a full policy space rather than just at states / state-actions

Most important thing about PI is that it can’t get stuck in a local minimum (because of value improvement/non-improvement).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is bounded loss/regret with respect to policies?

A

A policy that epsilon-optimal, i.e. that at each timestep produces a value that is no further than epsilon away than what would have been achieved by the optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly