Final Flashcards

1
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

I’ll go with False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Depending on the game/environment, you might lose on the first move. False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

An MDP given a fixed policy is a Markov chain with rewards.

A

True
Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.
MDP: Markov Decision Process
Markov Chain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Markovian

A

probablity of future states depends only on current state”

where I can go is dependent on where I am

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False, you can convert from Q-values to V values without knowing the transition matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

A

Adding any scalar to every reward is no different than adding zero. The min(a+100,b+100) is still the same as min(a,b) and similarly with max().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

“True”
Markov means future states only depends on current state
Answer depends on how you interpret state, does the agent count as part of the state?
Q-Table (agent) does retain information from previous experiences, but you don’t need to reference previous actions to decide the current action to take

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

A

False. A policy “which is greedy with respect to the optimal value function” is by defintion optimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

A

False

The sum has to diverge, but the sum of squares has to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The value of the returned policy is the only metric we care about when evaluating a learner.

A

False
False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

True, however to do it forward, you would have to wait for a complete episode, and therefore cant do it online. But it can be done both ways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Offline algorithms are generally superior to online algorithms.

A

Both on-policy/online and off-policy algorithms can do exploration, which is the key requirement for finding optimal policies. Ignoring the illdefined superlative (superior), both algorithm styles can generate the same optimal policy, but there is no guarantee as such.
Considering the superlative, one must first define it. If by superior you mean the policy that doesn’t result in falling off a cliff, then online is superior. If step-wise superiority is desired, then offline is better, but you’ll fall off the cliff more often.
False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given a model (T,R) we can also sample in, we should first try TD learning.

A

If we’re already given a model, then why default to model-free methods?
A good example is backgammon - a complete model is provided, but the state space is enormous (sweeping the state space would take too long) and the most successful methods have been model-free.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

False
False. TD(0) propagates information slowly. At every presentation, TD(1) is propagating backwards an outcome’s information to the full extent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

A

False. The shapes are similar (convex) but the minima are completely different (Desperately Seeking Sutton!)’

17
Q

In general, an update rule which is not a non-expansion will not converge.

A

True: A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.-
Non-expansion = MaxA - MaxB <= Max|A-B|

18
Q

MDPs are a type of Markov game

A

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

19
Q

Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.
Contraction mapping is an operator (B) that maps value functions to value functions which converge
Property 1 - has a solution and it is unique
Property 2 - value iteration converges
Example: y = 6 and z = 8
|By - Bz| <= |y - z|
B(x) = x+1 AND B(x) = x-1 are Not a contraction mapping because they will stay the same distance apart (2)
B(x) = x/2 IS a contraction mapping because 1 <= 2’

20
Q

Linear programming is the only way we are able to solve MDPs in linear time.

A

There is only one way that we know to solve MDPs in polynomial is Linear Programming
False. LPs have polynomial time complexity. The linearity refers to the objective and constraints.

21
Q

The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False. The objective of the dual LP is to maximize “policy flow”, subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

22
Q

Transition function T(S,A,S’)

A

Transition function T(S,A,S’) is the probability of ending up in S’

23
Q

Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

False in general. This is true if the reward is shaped with a potential-based shaping function.

24
Q

Potential-based shaping will find an optimial policy faster than an unshaped MDP.

A

“This paper examines potential-based shaping functions, which introduce artificial rewards in a particular form that is guaranteed to leave the optimal behavior unchanged yet can influence the agent’s exploration behavior to decrease the time spent trying suboptimal actions”

25
Q

Rmax will always find the optimal policy for a properly tuned learning function.

A

I think this is technically False (although true in spirit). The problem is the word ‘always.’ As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn’t reached, and if reached it’s not quite optimal.
I think it is false. Rmax will find the optimal policy only for the states visited by the agent. The states not visited, it is assumed to be an infinite loop with the maximum reward to itself, but no policy can be calculated to those non-visited states.

26
Q

Q-learning converges only under certain exploration decay conditions.

A

False. Q-learning is off policy so exploration is important, but greediness of the behavior policy isn’t.

27
Q

The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options

A

False. You may still need to explore “non-optimal” arms of the bandit since there may be noise in the estimate of its value.

28
Q

Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.

A

False. Being online doesn’t affect whether or not it learns the optimal policy. The amount of exploration done by the algorithm determines its ability to learn the optimal path. Case in point is the cliff walker example where the SARSA agent does not learn the optimal (step-wise optimal) path, but it learns the better path that is safer for the walker (a different definition of optimal). Cliff walker is in the Sutton-Barto book.
I’ll take “online” to mean “on-policy.” False. If the SARSA implementation is GLIE, then it can learn an optimal policy
GLIE = Greedy in Limit with Infinite Exploration

29
Q

RL with linear function approximation only works on toy problems.

A

False. If you engineer nonlinear features, then linear function approximation can still lead to quite expressive functions.
False. Nonlinear approximation can be approximated by linear piece-wise construction.

30
Q

KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.

A

FALSE, this was described in the KWIK paper provided for the homework that we all loved doing.
“However, some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.”

31
Q

With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.

A

False. Combining even LFA with off policy TD methods can lead to divergence.
False. Baird’s counter-example shows how classic update using linear function approximation can lead to divergence.
False. Sutton and Barto, some uses of linear and non-linear function approximation can diverge to infinity. Also, “However, generalization can cause divergence in the case of repeated boostrapped temporaldifference updates

32
Q

Applying generalization with an “averager” on an MDP results in another MDP.

A

True. When averager is applied to V(s’), we can rearrange the summation terms and write a new MDP with its value function in terms of transitions between state S and all other basis/anchor states.

33
Q

Q-learning has the following advantages and disadvantages compared to SARSA:

A

Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay ϵϵ in ϵ ϵ-greedy action choice, which may become a fiddly hyperparameter to tune.
Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.
SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.
In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.
If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.