Exam Study Flashcards
A Markov Process is a process in which all states do not depend on actions
True. Markov means that you don’t have to condition on anything past the most recent state. A Markov Decision Process is a set of Markov Property Compliant states, with rewards and values
Decaying Reward encourages the agent to end the game quickly instead of running around and gathering more rewards
True. As reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly.
R(s) and R(s,a) are equivalent
True. It just happens that it’s easier to think about one vs. the other in certain situations
Reinforcement Learning is harder to compute than a simple MDP
True. You can just use the Bellman for MDP but Reinforcement Learning requires that you make observations and then summarize those observations as values
An optimal policy is the best possible sequence of actions for an MDP
True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of “sequence” for policy.
Temporal Difference Learning is the difference in reward you see on subsequent time steps.
False. Temporal Difference Learning is the difference in value estimates on subsequence time steps.
RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based
True, Model-Based is essentially using the Bellman Equation to solve a problem Value-Based is Temporal Difference Learning Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it’s guaranteed)
TD Learning is defined by Incremental Estimates that are Outcome Based
True. TD Learning thinks of learning in terms of “episodes”, which it uses to estimate the transition functions rather than having a predefined model.
For learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum of the learning rate squared must be finite.
True. This is called contraction mapping and it guarantees convergence.
All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes.
True. This is why we use TD(Lamba), which has many fo the benefits of TD(0) but is much more performant. Empirically, lambas between 0.3 and 0.7 seem to perform best.
To control learning, you simply have the operator choose actions in addition to learning.
True. States are experienced as observations during learning, so the operator can influence learning.
Q-Learning converges
True. The Bellman Equation satisfies a Contraction Mapping where the sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q.
As long as the update operators for Q-learning or Value-iterations are non-expansions, then they will converge.
True. There are expansions that will converge, but only non-expansions are guaranteed to converge independent of their starting values.
A convex combination will converge.
False. It must be a fixed convex combination to converge. If the value can change, like the Boltzmann exploration, then it is no guaranteed to converge.
In Greedy Policies, the difference between the true value and the current value of the policy is less than some epsilon value for exploration.
True
It serves as a good check for how long we run value iteration until we’re pretty confident that we have the optimal policy.
True
For a set of linear equations, the solution can be found in polynomial time.
True
For Policy Iteration, Convergence is exact and complete in finite time.
False. This is only true for a Greedy Policy. It is true that it will be optimal policy, however.
The optimal policy will dominate all other policies in finite time.
This is true for greedy policy. This is because policies dominate on a state-by-state bases, so we don’t get stuck in local optima.
We get strict value improvement anytime we don’t have a policy that is already optimal.
True. Any time a policy is already optimal then a step of policy iteration results in a value non-deprovement. It might not get better, but it won’t get worse.
You can change the reward function only by multiplying the reward.
False. You can also change the reward function by shifting a constant, or using a potential functions. You can also scale positively, otherwise you may swap max and min by accident.
You can change the reward function only by adding a constant.
False. You can also change the reward function by multiplying the reward by a positive constant, or using a potential function.
You can change the reward function only using a potential function.
False, you can also change the reward function by multiplying the reward by a positive constant, or adding a constant.
You can change the reward function to learn faster only by using a potential function.
True. Multiplying a positive constant or shifting by a constant will not work, they work in roughly the same time.
You can change the reward function to learn faster by multiplying by a positive constant or shifting by a constant.
False. You can only speed up learning with a potential function.
Even when you use reward shaping, you’ll still always find the optimal policy.
False. You can sometimes end up in sub-optimal feedback loops.
The Gitten’s Index can be used for all Q-Learning algorithms.
False. This actually only works for K-Armed Bandit spaces, not for continuous spaces. This is because it depends on a ratio of observations to determine confidence, which cannot be achieved in spaces where the ratio changes.
The Gitten’s Index is the only metric for K-Armed Bandit problems.
False, there are actually several, including: - Identify optimal arm in the limit - Maximize the expected reward over a finite horizon - Find best - Few mistakes - Do well (the last 3 are related and have equivalent algorithms)
If you have an algorithm that can find the best arm, you can find an algorithm that can find an arm with few mistakes
True. If you have an algorithm that finds e-best with a probability of 1-delta in tau pulls, you will find that same arm with that same probability with an upper bound tau for the first algorithm.
We can estimate what we know using Hoeffding bounds and Union bounds.
True. In stochastic decision problems, we can convince ourselves that we have a sufficiently accurate estimate using those bounds. This lets us get near optimal rewards.
In a deterministic MDP, Rmax will cause us to explore all states by assuming that unexplored or unknown states will have the maximum possible reward.
True. In a sequential decision process, we can convince ourselves that we have a sufficient amount of information by proving that we have visited all states.
Using Rmax, once all edges are known, the algorithm will no longer make mistakes.
True. Because in step 3 we solve the MDP using the Bellman equations, which will give us the optimal policy and therefore never make a mistake.
Using Rmax, you will always visit all unknown states.
False. Sometimes you will end up in a suboptimal reward loop if the discount factor is too low or the state is too far away.
Using Rmax, you will never make any mistakes.
False. In order to learn new states, you may make mistakes in order to discover those states. The maximum number of mistakes that Rmax will make over the course of it’s lifetime is O(n^2 K). Where n is the number of states and k if number of actions.
Using a simulation lemma (theorem), if you have an estimate of the real MDP, then optimizing the rewards in the estimate will be near optimal for the real MDP.
True. If the difference between the transition functions is less than or equal to alpha, then the difference between the rewards is also less than or equal to alpha.
If all transitions are either accurately estimates or unknown, optimal policy is either near optimal or an unknown state is reached quickly.
False. Using the explore or exploit lemma (theorem), Rmax guarantees that we will either explore or be near optimal with ***high probability***, ***not guaranteed***. The maximum run time is either guarantee near optimal or learn something new is bounded by the number of states times the number of actions (n*k)
There is both a need and a method to do generalizations in RL.
True. The need comes from the fact that there can be zillions upon zillions of states. In a continuous space, it’s even impossible to visit every state once. The method is to take what we learned in supervised learning, specifically linear function approximation.
Policy Gradients are the most popular for linear function approximation for generalized reinforcement learning.
False. Value Function Gradients are the most popular because there’s been a bunch of success there. However, in the Robotics sector, policy gradients have been very successful.
A Linear Value Approximator always works
False. Baird’s Counterexample.
POMPs Generalize MDPs
True. All you need to do is add observables and an observation function to the MDP tuple. However, you cannot go in reverse.
You can always infer a state from observations.
False. You can infer the probability of being in a state, but you can’t always definitely say you are in a state. A POMP is just an MDP with probability 1.
A POMP has a finite number of states.
Somewhat false. The underlying MDP has a finite number of states, but because our observations (belief states) are probability distributions, the actual number of states is infinite.