Final Review Flashcards
True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?
False. Behaving greedily with respect to an optimal value function is an optimal policy because it optimizes the long-term cumulative reward.
A greedy policy derived from an optimal value function is necessarily optimal. However a greedy policy derived from just any value function is not necessarily optimal.
True or False: The optimal policy for any MDP can be found in polynomial time. Why?
True for any finite MDP. We can reduce the MDP to linear program and solve it in polynomial time.
True or False: The value of the returned policy is the only way to evaluate a learner. Why?
False, you might also care about the computational- or memory- or sample-efficiency
True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?
**verify from review 1
True. Because Q values needs to be multiplied with the corresponding transition probabilities and the summed over the action space to derive optimal Q values.
True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?
True. A fixed policy means same action selection for a given state. That means MDP is a a Markov chain with rewards.
An MDP describes (1) a set of states, (2) a set of actions (3) transition probabilities and (4) rewards for transitioning.
By applying a fixed policy (which defines what action to take in each state), we eliminate the choice of action, which means we’re left with:
[(1) a set of states (3) transition probabilities] (Markov Chain)
[(4) rewards for transitioning]. (w/ Rewards)
True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?
**verify from review 1
True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.
True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?
**verify from review 1
True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.
True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?
False. It depends on the discount factor.
True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?
I think True. Markov property means the future can be predicted only with the present.
Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks, and describe the type of state that allows you to unify them. (Hint: you may find section 3.4 of S&B helpful.)
We are talking about episodic and continuous tasks. The returns from these tasks could be unified using a self-loop with zero rewards to the terminal state for an episodic task.
Does policy iteration or value iteration converge faster? If so, why?
Policy iteration converges faster because it takes few iterations than value iterations.
Is it possible that value iteration just never converges on your formulation of MDP with a proper convergence criterion?
No value iteration is guaranteed to converge.
What is the meaning of the bellman equations? What is it saying? What is your way of visualizing this equation in your head?
The long-term return for an action is the sum of the current reward and all possible future rewards.
the long term return for an action is the sum of the current reward and all possible discounted future rewards.
What are the bellman equations? (try to write it down without looking at notes since this is the single most important equation to know for this class)
What is a policy?
choice function for action for a given state.