Review Questions Flashcards
True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?
False. Behaving greedily with respect to an optimal value function is an optimal policy because it optimizes the long-term cumulative reward.
True or False: The optimal policy for any MDP can be found in polynomial time. Why?
True for any finite MDP. We can reduce the MDP to linear program and solve it in polynomial time.
False for an infinite MDP
True or False: The value of the returned policy is the only way to evaluate a learner. Why?
False, you might also care about the computational- or memory- or sample-efficiency
True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?
False. V = max(Q)
True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?
False. We can always convert a finite horizon MDP to an infinite horizon MDP by adding a self-loop with the terminal state.
True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?
True. A fixed policy means same action selection for a given state. That means MDP is a a Markov chain with rewards.
True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?
True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.
True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?
False. It depends on the discount factor.
True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?
True. Markov property means the future can be predicted only with the present.
Does policy iteration or value iteration converge faster? If so, why?
Policy iteration is guaranteed to converge at least as fast as value iteration
Is it possible that value iteration just never converges on your formulation of MDP with a proper convergence criterion?
No value iteration is guaranteed to converge.
What is the meaning of the bellman equations? What is it saying? What is your way of visualizing this equation in your head?
The long term return for an action is the sum of the current reward and all possible discounted future rewards.
What are the bellman equations? (try to write it down without looking at notes since this is the single most important equation to know for this class)
U(s,a,s’) = r(s’) + gamma * max_a [Sum_s’ T(s,a,s’)*U(s’)]
What is a policy?
Choice function for action for a given state.
What is Value Iteration?
Value iteration is an iterative algorithm to compute the optimal value function which is iteratively updating a value function that we use to create a policy. Contrasting with policy iteration where we directly update the policy.
What is the Markov property?
Markov property is a memoryless property: your future is only dependent on your present.
Is it possible to have multiple terminal states in an MDP?
There is no restriction for an MDP to have multiple terminal states, although they can be abstracted into a single terminating state.
What is an MDP? What are the components of an MDP?
States, Actions, Model (transition function, reward function), Policy
True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?
My impression is that non-expansion is a general case of contraction mapping. For contraction mapping |F(a) - F(b)| <= gamma*|a-b| where gamma needs to be strictly smaller than 1. But for non-expansion the gamma value can be 1 so it allows the distance to be smaller or remain equal
True or False: An update rule which is not a non-expansion will not converge without exception. Why?
In general, True. But as an ultimatum, this is False.
Non-expansion is a condition that guarantees convergence. However, the lack thereof does not guarantee the lack of convergence.
Coco-Q is a counter example.
True or False: In TD(λ), we should see the same general curve for best learning rate (lowest error), regardless of λ value. Why?
True. See figure 4 in Suttons TD(λ) paper.
True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?
False.
TD(0) is == to Monte Carlo learning which updates very slowly. TD(1) propagates information all the way down the “chain” for every presentation.
TD(1) has high variance that is why we need repeated presentation.
True or False: Given a model (T,R) we can also sample in, we should first try TD learning. Why?
False.
While TD learning was revolutionary and is used in many algorithms today, value or policy iterations have better convergence properties and should by tried first.
True or False: Offline algorithms are generally superior to online algorithms. Why?
False.
Online algorithms update values as soon as new information is presented, therefore making the most efficient use of experiences.