FinalExamExamples1 Flashcards
Markov means RL agents are amnesiacs and forget everything up until the current state.
True. A process is defined as Markov if the transition to the next state is fully defined by the current state. However, the current state can be expanded to include past states to force non-Markov processes into a Markovian framework.
In RL, recent moves influence outcomes more than moves further in the past.
False. One of the main ideas of RL is the future value of some state or action may be dependent on states far back in the history. This is modified by gamma (the discount factor) though: we can use larger gamma values (but strictly less than one for infinite horizon situations) to make the agent care about longer term rewards.
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s rewards (terminal and non-terminal) the optimal policy will not change
True. Adding the same scalar value to all states leaves the underlying MDP unchanged.
An MDP given a fixed policy is a Markov Chain with rewards.
True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and is is now reduced to a Markov chain with rewards.
It is not always possible to convert a finite horizon MDD to an infinite horizon MDP.
False. We can always convert to infinite horizon by adding a terminal state with a self-loop (with a reward of 0).
If we know optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False. Knowing the optimal Q values is essentially the same as knowing the optimal V values, i.e. max[V(s)] == max[Q(s, a)].
The value of the returned policy is the only way to evaluate a learner.
False. Time and space complexities of the learner are also among the indicators.
The optimal policy for any MDP can be found in polynomial time.
True. For any (finite) MDP, we can form the associated linear program (LP) and solve it in polynomial time (via interior point methods, for example, although for large state spaces an LP may not be practical). Then we take the greedy policy and voila!
A policy that is greedy - with respect to the optimal value function - is not necessarily an optimal policy.
False. Optimal action on an optimal value function is optimal by definition.
In TD learning, the sum of the learning rates must converge for the value function to converge.
False. The sum of the SQUARES of the learning rates must converge for the value function to converge. The sum must actually diverge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore it is the preferred algorithm when doing RL with episodic tasks.
False. TD(1) is actually equivalent to MC. It is true that MC is an unbiased estimator compared to TD, but it has high VARIANCE. TD, on the other hand, has high bias but low variance, which often makes it better for learning from sequential data.
Backward and forward TD(lambda) can be applied to the same problems.
True (S&B chapter 12 discusses these views and shows their equivalence). However, in practice backward TD(lambda) is usually easier to compute.
Offline algorithms are generally superior to online algorithms.
False. It depends on the problem context. Online algorithms update values as soon as new information is available and makes most efficient use of experiences.
Given a model (T,T) we can also sample in, we should first try TD learning.
False. You have a model - use it!
TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.
False. TD(0) propagates slowly while TD(1) propagates information all the way in each presentation.