Review Session #3 Flashcards
True or False: In general, an update rule which is not a non-expansion will not converge.
False. Coco-Q is an exception as noted in Lecture 4. In general though, you should expect this to hold up.
True or False: MDPs are a type of Markov game.
True, MDPs are a single-player Markov game
True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.
False, a non-expansion is a type of contraction mapping.
True or False: Linear programming is the only way we are able to solve MDPs in linear time.
False, linear programming solves MDPs in polynomial time.
False, linear programming is not used to solve MDPs. Dynamic programming is typically used.
True or False: The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)
False, the objective is to maximize the “policy flow”.
True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP.
False, only potential-based reward shaping must preserve the original MDP’s optimal policy.
True or False: Potential-based shaping will find an optimial policy faster than an unshaped MDP.
False, this depends on the selected potential. It is possible to get stuck in a sub-optimal loop before eventually finding the optimal policy.
True or False: Rmax will always find the optimal policy for a properly tuned learning function.
False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.
True or False: Q-learning converges only under certain exploration decay conditions.
False, Q-learning can converge even when random actions are performed since this is an off-policy algorithm.
True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.
False, depending on the confidence level we feel comfortably with we stop exploring bandits we “believe” we have obtained the optimal solution for.