Review Session #3 Flashcards

Question 1

Q

True or False: In general, an update rule which is not a non-expansion will not converge.

Answer

A

False. Coco-Q is an exception as noted in Lecture 4. In general though, you should expect this to hold up.

Question 2

Q

True or False: MDPs are a type of Markov game.

Answer

A

True, MDPs are a single-player Markov game

Question 3

Q

True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

Answer

A

False, a non-expansion is a type of contraction mapping.

Question 4

Q

True or False: Linear programming is the only way we are able to solve MDPs in linear time.

Answer

A

False, linear programming solves MDPs in polynomial time.

False, linear programming is not used to solve MDPs. Dynamic programming is typically used.

Question 5

Q

True or False: The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

Answer

A

False, the objective is to maximize the “policy flow”.

Question 6

Q

True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP.

Answer

A

False, only potential-based reward shaping must preserve the original MDP’s optimal policy.

Question 7

Q

True or False: Potential-based shaping will find an optimial policy faster than an unshaped MDP.

Answer

A

False, this depends on the selected potential. It is possible to get stuck in a sub-optimal loop before eventually finding the optimal policy.

Question 8

Q

True or False: Rmax will always find the optimal policy for a properly tuned learning function.

Answer

A

False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.

Question 9

Q

True or False: Q-learning converges only under certain exploration decay conditions.

Answer

A

False, Q-learning can converge even when random actions are performed since this is an off-policy algorithm.

Question 10

Q

True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.

Answer

A

False, depending on the confidence level we feel comfortably with we stop exploring bandits we “believe” we have obtained the optimal solution for.

Review Session #3 Flashcards

(10 cards)