CS7642_Week3 Flashcards
What is the difference between RL with control and RL without control?
Control in RL just means that actions are being chosen by the learner.
What does a contraction mapping do when applied to to functions F and G?
It brings them closer together
What three things need to hold for the convergence theorem to guarantee that Q converges in the limit?
- Averaging over noisy transitions
- Update operator must be a non-expansion (i.e. a contraction. This is calculation that is essentially the one-step lookahead that we did in value iteration
- Learning rate sequence must sum to infinity, sum of squares must be finite.
How might we compute how much we do (or don’t) care about future rewards?
It’s proportional to something like H ~ 1 / (1 - gamma)
Why is it not a good idea to set gamma to very small values?
You end up with an agent that acts myopically, always seeking immediate gratification rather than playing “the long game”
What are three important features of policy iteration (PI)? What is the tradeoff we make when using PI over VI? What is the most important feature of PI?
- Q –> Q* in the limit
- Convergence is exact and complete in finite time (assuming use of the greedy policy)
- Computes at least a fast as value iteration (VI)
The downside of PI is computational complexity because now we have to iterate over a full policy space rather than just at states / state-actions
Most important thing about PI is that it can’t get stuck in a local minimum (because of value improvement/non-improvement).
What is bounded loss/regret with respect to policies?
A policy that epsilon-optimal, i.e. that at each timestep produces a value that is no further than epsilon away than what would have been achieved by the optimal policy.