Final Exam Flashcards
What is an MDP?
A Markov decision process
What are the components of an MDP?
State, Action, Reward
Is it possible to have multiple terminal states in an MDP?
Yes
What is the Markov property?
Next state only depends on current state.Everything one needs to know about the environment is encoded in the current state. There is no need to access data from past steps.
What is Value Iteration?
Using the Bellman equation to update the value of each state iteratively until convergence. Updating the policy once per iteration until convergence.
What is a policy?
A policy is a mapping from states to actions for an MDP.
What is the meaning of the Bellman equations? What do they do?
“(the Bellman equation) expresses a relationship between the value of a state and the values of its successor states.” S&B 2020
OR
An update rule for the value of a state based on the seen reward and the expected value of the next state.
Is it possible that value iteration never converges on your formulation of MDP with a proper convergence criterion?
False. “(Policy and Value Iteration)… All of these algorithms converge to an optimal policy for discounted finite MDPs.” S&B 2020.
Does policy iteration or value iteration converge faster? Give an explanation of your answer.
Both algorithms are guaranteed to converge to an optimal policy in the end. Yet, the policy iteration algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.
Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks, and describe the type of state that allows you to unify them. (Hint: you may find section 3.4 of S&B helpful.)
Episodic - sequences of states that repeat … for example a card game gets to the end and then restarts with a shuffle/deal. And so would also be finite.
Continuing - sequences of states with no definable end. Like predicting the weather … the days continue with new weather each time … you never get to the last day and restart the episode.
You can unify them by making the last state in episodic tasks act as a sink that transitions to itself with probability 1.0 and reward = 0.
We use lambda to convert continuing tasks to simulate a finite length task; but, it would never become episodic as you’d never reset to the original state 0.
True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?
True, the markov property states that an agent only needs the data of the currents state. However, the current state can have past experiences/observations included in the current state.
True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?
False
True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?
True. Adding a constant to reward would not change the policy. This is not considered as reward shaping.
True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?
True. This is because the agent would be traversing through a fixed set of states which will never change.
True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?
False … can’t we always set the last state to loop back upon itself with prob 1.0. Then it can go on forever (and the reward would be 0).
True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?
False, V(s) = max_a Q(s,a). Thus V(s) is the action that maximizes Q(s,a). If we already have all of the Q-values for each possible action of a given state, V(s) is simply the largest number in the set of Q-values. No transition matrix needed.
True or False: The optimal policy for any MDP can be found in polynomial time. Why?
True - “A DP (Dynamic Programming) method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is k^n” S&B 2020
True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?
False. Optimal action on an optimal value function is optimal by definition.
What is TD? In which algorithms is TD used?
Temporal Difference. Classical TD has TD(0) - One-Step, TD(1) - Monte-Carlo, TD(lambda) - weighted look-ahead. A one-step TD update is the basis of all bootstrapping RL algorithms such as Q-learning, SARAS, DQN, PPO.
What are model-free methods?
Methods that do not use transition probabilities and reward functions
“Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning” - S&B 2020
What is the prediction problem?
Estimating/Predicting the value of each state.
As opposed to the Control Problem, in which we can take actions to impact the sequence of states.
What is a n-step estimator?
This is an estimator of state values that looks ahead n time steps into the future.
What is a Monte-Carlo return?
The return of a full episode (sum of all rewards)
What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?
0, because there are no future rewards for being in the terminal state.
What’s the formula to view TD(\lambdaλ) as a function of n step estimators?
What can we say about the bias-variance trade-off between a 1-step estimator versus a Monte Carlo estimator (in other words, which estimator has a higher variance and which estimator has a higher bias?)
1-step estimator will have much higher bias due to bootstrapping whereas Monte Carlo will have much higher variance because it depends on a lot of signals.
What are some advantages of TD methods compared to dynamic programming methods? to Monte Carlo methods?
With TD you do not necessarily need a model, whereas with dynamic programming you do. TD reduces the variance of the expected return compared with MC.
Dynamic programming methods need a model, Monte Carlo methods will have high variance and TD methods can have high bias
True or False: In TD learning, the sum of the learning rates used must converge for the value function to converge. Why?
False, The sum of the learning rates must not converge but go to infitinty. However, the sum of the squares of the learning rates must converge.
True or False: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks. Why?
MC is an unbiased estimator, but it is also has high variance. TD methods are preferred because they minimise variance and also approximate the ML estimate of an MDP. TD methods can also make a trade-off between bias and variance through TD(lambda).
True or False: Backward and forward TD(\lambdaλ) can be applied to the same problems. Why?
Technically True, but backward methods offer a convenient way of learning online when experience cannot be repeated or is expensive to generate. If episode data is cheap or stored, then using a backward view (with eligibility traces) does not have a major benefit.
True or False: Offline algorithms are generally superior to online algorithms. Why?
False, there are cases where either will do better. E.g. online algorithms can better deal with non-stationary tasks.
True or False: Given a model (T,R) we can also sample in, we should first try TD learning. Why?
false. If we are simply sampling from (transition, reward) then there is no concept of states to which to apply TD-methods, which require a temporally-ordered sequence of (S,R) to apply. If there is no temporal order to the model, then it makes no sense to apply a TD method. An MC method would be better.
True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?
False. TD(1) is MC. Assuming an update is applied at the end of an episode, it performs worse. The return is apportioned to each state proportionally to the number of times it is visited, which means it diverges with repeated presentations.
True or False: In TD(\lambdaλ), we should see the same general curve for best learning rate (lowest error), regardless of \lambdaλ value. Why?
False, empirically we see different curves. The best value of lambda will vary across problems. Short episodes may benefit from smaller values of lambda for example.
True or False: An update rule which is not a non-expansion will not converge without exception. Why?
False, non-expansions and contractions are the only things guaranteed to converge because the gradients will shrink to 0 as the operations are applied.
When saying that it is not a non-expansion, it only indicates that there is no guarantee that the updates will approach each other. However, I believe it as also not certain that the updates won’t approach each other. It might be that convergence still happens.
True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?
False, non-expansion means that the signal is shrinking monotonically. It’s the less restrictive version of a contraction
What is an eligibility trace?
An eligibility trace is a weight value corresponding to how many time steps have passed since that state “participated in producing an estimate value” (Sutto Barto p.287).
The eligibility trace defines how much a given state estimate is updated.
It can be used to calculate TD updates more efficiently. It is used to generalize TD(0) and TD(1) methods to TD(\lambda) methods.
In Sutton 88. What is “the maximum likelihood estimate of the underlying MDP” (quoted from the paper)?
“In the following section, we prove that in fact it is linear TD(0) that converges to what can be considered the optimal estimates for matching future experience - those consistent with the maximum-likelihood estimate of the underlying Markov process.”