CS7642_Week2 Flashcards
How do we evaluate a learner?
- Value of the returned policy
- Computational complexity (time)
- Experience complexity (i..e how much data it needs)
What are the 3 “classes” of solution methods for solving RL problems? (bonus: what category do TD methods fall into?)
- Model-based
- Value-based (TD methods fall into this)
- Policy-based
What properties must the learning rate have for RL?
- SUM of all learning rate values must be infinite
2. SUM OF SQUARES of learning rate values must be finite
Names some of the differences between TD(0) and TD(1)
TD(0):
- Slow to propagate information
- High bias, low variance
- Maximum likelihood estimate (MLE)
TD(1):
- Equivalent to MC, samples full trajectories
- Requires full trajectory in order to update
- Low bias, high variance
What values lambda tend to work well (empirically speaking) when used in TD(lambda)?
0.3-0.7
Does Q-learning always converge? If so, what does it converge to?
Yes, it does converge, and in fact converges to the optimal value Q*
What is non-expansion/contractions?
TODO: Watch lesson 5 on convergence (need to particularly pay attention to stuff on contractions and non-expansion at a conceptual level)
What things are contraction mappings / non-expansions?
Order statistics, FIXED convex combinations