CS7642_Week3 Flashcards

Question 1

Q

What is the difference between RL with control and RL without control?

Answer

A

Control in RL just means that actions are being chosen by the learner.

Question 2

Q

What does a contraction mapping do when applied to to functions F and G?

Answer

A

It brings them closer together

Question 3

Q

What three things need to hold for the convergence theorem to guarantee that Q converges in the limit?

Answer

A

Averaging over noisy transitions
Update operator must be a non-expansion (i.e. a contraction. This is calculation that is essentially the one-step lookahead that we did in value iteration
Learning rate sequence must sum to infinity, sum of squares must be finite.

Question 4

Q

How might we compute how much we do (or don’t) care about future rewards?

Answer

A

It’s proportional to something like H ~ 1 / (1 - gamma)

Question 5

Q

Why is it not a good idea to set gamma to very small values?

Answer

A

You end up with an agent that acts myopically, always seeking immediate gratification rather than playing “the long game”

Question 6

Q

What are three important features of policy iteration (PI)? What is the tradeoff we make when using PI over VI? What is the most important feature of PI?

Answer

A

Q –> Q* in the limit
Convergence is exact and complete in finite time (assuming use of the greedy policy)
Computes at least a fast as value iteration (VI)

The downside of PI is computational complexity because now we have to iterate over a full policy space rather than just at states / state-actions

Most important thing about PI is that it can’t get stuck in a local minimum (because of value improvement/non-improvement).

Question 7

Q

What is bounded loss/regret with respect to policies?

Answer

A

A policy that epsilon-optimal, i.e. that at each timestep produces a value that is no further than epsilon away than what would have been achieved by the optimal policy.