Final Flashcards

1
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Anonymous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

An MDP given a fixed policy is a Markov chain with rewards.

A

True since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

f we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False, you don’t need the transition function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

A

True, assuming an infinite horizon the optimal policy will be unchanged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

True here: the current state is all the agent should know.

Now if you want to discuss what a “current state” is… Well that can get more complicated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

A

false. optimal action, on optimal value function, is by definition, optimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

A

False. As pointed out, there are other things you might want to consider as well…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The value of the returned policy is the only metric we care about when evaluating a learner.

A

False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F POMDP allow us to strike a balance actions to gan reward and actions to gain information

A

TRUE this was all folded into one model (there was not special info to do this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DEC-POMDP stands for

A

Decentralized Partially Observable Markov Decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The primary difference between POMDPs and DEC-POMDPS

A

Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DEC-POMDPS vs POSG

A

DEC-POMDP all share reward function (they are working together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T/F DEC-POMDPs are represented by Ri (diff reward for each agent)

A

FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Properties of DEC-POMDP

A
  1. Elements of game theory and POMDPS

2. NEXP-complete (for finite horizon)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Inverse RL

A

Input behavior and environment and OUTPUT reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

MLRL

A

Guess R, compute policy, measure probability of data given policy, compute gradient R to find how to change reward to make it fit the data better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?

A

Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Drama Management

A

There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

TTD-MDPs vs MDPs

A

rather than states as, have trajectories (sequence so far) and rather than rewards, have target distribution p(T).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Value Iteration Algorithm

A

Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Supervised Learning

A

y = f(x). Function approximation, find f that maps y to x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Unsupervised Learning
f(c). Clusters descripition
26
Reinforcement Learning
y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z
27
MDP stands for
Markov Decision Processes.
28
MDPs are made up of
States, transitions (model), actions, rewards and create a policy.
29
Markovian Property
Only the present matters AND things are stationary (rules/the world doesn't change over time)
30
Delayed Rewards
In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward
31
Temporal Credit Assignment Problem
the problem of determining the actions that lead to a certain outcome in sequence
32
How to change policies to account for finite horizons
Policy(s,t).. policy is function of state AND time
33
Utility of Sequences (stationary preferences)
if you prefer one sequence of states over another today, you prefer the same sequence tomorrow
34
How to calculate infinite horizons without infinity?
Use discounted future rewards (use gamma)
35
Reward vs Utility
Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward
36
Policy Iteration
start with initial policy (guess) evaluate: given policy, calculate utility improve: Policy at t+1 is the action that maximizes the utility
37
T/F Policy Iteration won't converge
False
38
On Policy vs Off-Policy
off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning) on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy's action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)
39
T/F TD update rule always converges with any learning rate
False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity
40
T/F TD(1) is the same as outcome-based updates (if no repeated states)
True. and with even more learning because updates don't have to wait for the episode to be over
41
Maximum Likelihood Estimate vs outcome-based estimate (TD(1))
Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)
42
T/F TD(0) is the same as maximum likelihood estimate
TRUE if we run over the data over and over again
43
T/F TD(lambda) is weighted combination of k step estimators
True
44
T/F TD(1) typically has less error than TD(0)
False. TD(1) typically has more error than TD(0)
45
T/F TD(0) has the least amount of error usually
False. TD(lambda) performs best. Usually 0.3 - 0.7 is best
46
Temporal Difference is
Difference between reward (value estimates) as we go from one step to another
47
T/F reward must be scalar
TRUE
48
T/F environment is visible to the agent
False, usually
49
T/F all representations of state have same results
False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5
50
Partially Observable
Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP
51
T/F Model is the environment
False. Model is the agent's idea of the environment
52
RL vs Planning
RL: model is unknown, the agent performs actions Planning: model is known, the agent performs computations (know planning)
53
B is a Contraction Mapping when
||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G
54
Backward and forward TD(lambda) can be applied to the same problems.
True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work "online")...
55
gamma of 0 means
you care only about the current (short-sighted)
56
gamma of 1 means
you care about the future a lot (far-sighted)
57
why use gamma and discount future rewards
avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)
58
T/F You can always flatten an MDP to a Markov Chain
True
59
Difference between MDP and Markov Reward Process
An MDP is a Markov Reward Process with decisions. Includes actions and stochasticity
60
T/F it has been proven that both forward and backward view of TD(lambda) are equivalent
True
61
What is forward TD(lambda)
The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.
62
What is backward TD(lambda)
The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace
63
Online learning vs offline learning
Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset. Let's say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning
64
SARSA vs Q- Learning
In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.
65
T/F Offline algorithms are generally superior to online algorithms.
False Generally, because it depends on the problem and a number of other things.
66
T/F Given a model (T,R) we can also sample in, we should first try TD learning.
false, if you have the model one would expect it to be more efficient to use it.
67
T/F TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.
False. TD(0) propagates more slowly.
68
T/F In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.
Generally True, same shape but different minima
69
T/F In general, an update rule which is not a non-expansion will not converge.
True. A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.
70
T/F MDPs are a type of Markov game.
True. Markov games with only one player action space and a corresponding single reward function are MDPs.
71
T/F Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.
False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.
72
T/F Linear programming is the only way we are able to solve MDPs in linear time.
False. First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don't use LP that solve MDPs (though none in linear time).
73
T/F The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)
False. The objective of the dual LP is to maximize "policy flow", subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.
74
T/F max is non-expansion
True
75
T/F Any optimal policy found with reward shaping is the optimal policy for the original MDP.
False, only potential based shaping might avoid introducing an sub-optimal policy loop.
76
Generalized MDPs
as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration,
77
examples of non-expansion
order statistics (max, min), fixed convex combinations
78
T/F If you get a good enough approximation of the value function then you are bound on how far off you are
True
79
T/F Policy Iteration converges at least as fast as VI
True
80
T/F Policy Iteration is more computationally expensive than VI
True
81
Domination
policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2
82
T/F policy iteration won't get stuck in a local optimal
True
83
T/F value non-deprovement in steps of policy iteration
true
84
How can we change the MDP reward function w/o changing optimal policy (3 ways)
multiply by positive constant, shift by constant, non linear potential based rewards
85
T/F initializing q values is the same as using potential based functions
True. Initialize to 0. Random means you are biasing
86
Potential-based shaping will find an optimal policy faster than an unshaped MDP.
False... if the selected potential is wrong, it will slow things down.
87
T/F potential functions can speed things up
True
88
What is the problem with reward shaping
We can screw things up and create suboptimal policies optimized for these "helper" scenarios
89
T/F Rmax will always find the optimal policy for a properly tuned learning function.
False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal.
90
Q-learning converges only under certain exploration decay conditions.
False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.
91
The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.
False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.
92
T/F Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.
False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.
93
T/F RL with linear function approximation only works on toy problems.
False. Nonlinear approximation can be approximated by linear piece-wise construction.
94
T/F KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.
False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.
95
T/F With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.
False. Baird's counter-example shows how classic update using linear function approximation can lead to divergence.
96
T/F Applying generalization with an "averager" on an MDP results in another MDP.
True. As noted in Gordon (1995), you will have less states...
97
Hoeffding and union bounds
a way to know how much we know (certainty). know when we are certain enough
98
Rmax
optimism in the face of uncertainty
99
Can combine Rmax bandits
stochastic + sequential
100
When does generalization work
essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)
101
Generalization
some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation
102
T/F averages can converge to optimal solution (but does not always)
True. Need anchor points to converge
103
T/F value iteration converges for POMDPs
False, but you can get near-optimal
104
T/F POMDPs are easy to solve
False. Piecewise linear convex can do it and throw away a lot that are striclty dominated
105
Model based POMDP
use expectation maximization to learn the model
106
memoryless POMDP
helps to be random
107
Properties of Monte Carlo Tree Search
useful for large states, needs lot of samples to get good estimates, planning time indepedent of size of state space, running time is exponential in the horizon
108
Gittens index only works for
bandits
109
How does monte carlo tree search get q value estimates
random simulation
110
Goal abstrction
Modular RL and arbitration
111
Temporal abstractions (hierarchical RL)
options and semiMDPs
112
Pavlov strategy
cooperate if agree, defect if disagree
113
Subgame perfect
always best response independent of history
114
Properties of minimax 1 for 2 player 0 sum games
value iteration works, minimaX Q converges, efficent
115
Properites of nash q for multiplayer
value iteration does not work, minimax q does not coverge
116
Does coco generalize to n>2 players
no
117
what is coco
cooperative competitive values
118
T/F POMDPs can be planned
False
119
Semi-MDP
Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.
120
T/F b/c SMDPs are abstracting MDP they inherit the hierarchical optimality, convergence and stability of an MDP
FALSE, true with the vaceat that we should choose the right options. Also, the sttates don't matter and helps with exploration