Final Flashcards
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
I’ll go with False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Depending on the game/environment, you might lose on the first move. False.
An MDP given a fixed policy is a Markov chain with rewards.
True
Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.
MDP: Markov Decision Process
Markov Chain
Markovian
probablity of future states depends only on current state”
where I can go is dependent on where I am
If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you can convert from Q-values to V values without knowing the transition matrix
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
Adding any scalar to every reward is no different than adding zero. The min(a+100,b+100) is still the same as min(a,b) and similarly with max().
Markov means RL agents are amnesiacs and forget everything up until the current state.
“True”
Markov means future states only depends on current state
Answer depends on how you interpret state, does the agent count as part of the state?
Q-Table (agent) does retain information from previous experiences, but you don’t need to reference previous actions to decide the current action to take
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
False. A policy “which is greedy with respect to the optimal value function” is by defintion optimal.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False
The sum has to diverge, but the sum of squares has to converge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False
The value of the returned policy is the only metric we care about when evaluating a learner.
False
False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.
Backward and forward TD(lambda) can be applied to the same problems.
True, however to do it forward, you would have to wait for a complete episode, and therefore cant do it online. But it can be done both ways.
Offline algorithms are generally superior to online algorithms.
Both on-policy/online and off-policy algorithms can do exploration, which is the key requirement for finding optimal policies. Ignoring the illdefined superlative (superior), both algorithm styles can generate the same optimal policy, but there is no guarantee as such.
Considering the superlative, one must first define it. If by superior you mean the policy that doesn’t result in falling off a cliff, then online is superior. If step-wise superiority is desired, then offline is better, but you’ll fall off the cliff more often.
False
Given a model (T,R) we can also sample in, we should first try TD learning.
If we’re already given a model, then why default to model-free methods?
A good example is backgammon - a complete model is provided, but the state space is enormous (sweeping the state space would take too long) and the most successful methods have been model-free.
TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.
False
False. TD(0) propagates information slowly. At every presentation, TD(1) is propagating backwards an outcome’s information to the full extent.