Final Review pt. 6 Flashcards
True or False: Options over an MDP form another MDP. Why?
True. Options over MDPs form SMDPs which can be converted to MDPs.
Can we do Q learning on SMDP? If so, how do we do that?
True, we use observations in place of actions and discounted rewards in place of rewards
Is value iteration guaranteed to converge to the optimal policy for an SMDP?
with a particular choice of option there is no guarantee that it would end up with
an optimal policy
Is value iteration guaranteed to converge for an SMDP?
I’m thinking value iteration is guaranteed to converge with an SMDP but not necessarily to the optimal policy.
Does SMDP generalize over MDP?
Yes, SMDP is a generalization of MDP. In SMDP, the timestep between actions is a variable.
How is SMDP defined? Why is it called semi-Markov?
SMDPs are Markov Decision Processes that use options instead of discrete “atomic” actions. These options are allowed to take variable time instead of discrete time.
How is an option defined?
<i> : I = the initialization set of states, Pi = the policy to take with that option, and Beta = the termination set of states.</i>
</i>