Final Flashcards

1
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Anonymous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

An MDP given a fixed policy is a Markov chain with rewards.

A

True since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

f we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False, you don’t need the transition function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

A

True, assuming an infinite horizon the optimal policy will be unchanged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

True here: the current state is all the agent should know.

Now if you want to discuss what a “current state” is… Well that can get more complicated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

A

false. optimal action, on optimal value function, is by definition, optimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

A

False. As pointed out, there are other things you might want to consider as well…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The value of the returned policy is the only metric we care about when evaluating a learner.

A

False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F POMDP allow us to strike a balance actions to gan reward and actions to gain information

A

TRUE this was all folded into one model (there was not special info to do this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DEC-POMDP stands for

A

Decentralized Partially Observable Markov Decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The primary difference between POMDPs and DEC-POMDPS

A

Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DEC-POMDPS vs POSG

A

DEC-POMDP all share reward function (they are working together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T/F DEC-POMDPs are represented by Ri (diff reward for each agent)

A

FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Properties of DEC-POMDP

A
  1. Elements of game theory and POMDPS

2. NEXP-complete (for finite horizon)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Inverse RL

A

Input behavior and environment and OUTPUT reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

MLRL

A

Guess R, compute policy, measure probability of data given policy, compute gradient R to find how to change reward to make it fit the data better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?

A

Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Drama Management

A

There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

TTD-MDPs vs MDPs

A

rather than states as, have trajectories (sequence so far) and rather than rewards, have target distribution p(T).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Value Iteration Algorithm

A

Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Supervised Learning

A

y = f(x). Function approximation, find f that maps y to x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Unsupervised Learning

A

f(c). Clusters descripition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Reinforcement Learning

A

y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

MDP stands for

A

Markov Decision Processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

MDPs are made up of

A

States, transitions (model), actions, rewards and create a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Markovian Property

A

Only the present matters AND things are stationary (rules/the world doesn’t change over time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Delayed Rewards

A

In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Temporal Credit Assignment Problem

A

the problem of determining the actions that lead to a certain outcome in sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How to change policies to account for finite horizons

A

Policy(s,t).. policy is function of state AND time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Utility of Sequences (stationary preferences)

A

if you prefer one sequence of states over another today, you prefer the same sequence tomorrow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How to calculate infinite horizons without infinity?

A

Use discounted future rewards (use gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Reward vs Utility

A

Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Policy Iteration

A

start with initial policy (guess)

evaluate: given policy, calculate utility
improve: Policy at t+1 is the action that maximizes the utility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

T/F Policy Iteration won’t converge

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

On Policy vs Off-Policy

A

off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning)
on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy’s action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

T/F TD update rule always converges with any learning rate

A

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

T/F TD(1) is the same as outcome-based updates (if no repeated states)

A

True. and with even more learning because updates don’t have to wait for the episode to be over

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Maximum Likelihood Estimate vs outcome-based estimate (TD(1))

A

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

T/F TD(0) is the same as maximum likelihood estimate

A

TRUE if we run over the data over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

T/F TD(lambda) is weighted combination of k step estimators

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

T/F TD(1) typically has less error than TD(0)

A

False. TD(1) typically has more error than TD(0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

T/F TD(0) has the least amount of error usually

A

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Temporal Difference is

A

Difference between reward (value estimates) as we go from one step to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

T/F reward must be scalar

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

T/F environment is visible to the agent

A

False, usually

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

T/F all representations of state have same results

A

False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5

50
Q

Partially Observable

A

Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP

51
Q

T/F Model is the environment

A

False. Model is the agent’s idea of the environment

52
Q

RL vs Planning

A

RL: model is unknown, the agent performs actions
Planning: model is known, the agent performs computations (know planning)

53
Q

B is a Contraction Mapping when

A

||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G

54
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work “online”)…

55
Q

gamma of 0 means

A

you care only about the current (short-sighted)

56
Q

gamma of 1 means

A

you care about the future a lot (far-sighted)

57
Q

why use gamma and discount future rewards

A

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

58
Q

T/F You can always flatten an MDP to a Markov Chain

A

True

59
Q

Difference between MDP and Markov Reward Process

A

An MDP is a Markov Reward Process with decisions. Includes actions and stochasticity

60
Q

T/F it has been proven that both forward and backward view of TD(lambda) are equivalent

A

True

61
Q

What is forward TD(lambda)

A

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

62
Q

What is backward TD(lambda)

A

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

63
Q

Online learning vs offline learning

A

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset.
Let’s say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning

64
Q

SARSA vs Q- Learning

A

In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.

65
Q

T/F Offline algorithms are generally superior to online algorithms.

A

False Generally, because it depends on the problem and a number of other things.

66
Q

T/F Given a model (T,R) we can also sample in, we should first try TD learning.

A

false, if you have the model one would expect it to be more efficient to use it.

67
Q

T/F TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

False. TD(0) propagates more slowly.

68
Q

T/F In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

A

Generally True, same shape but different minima

69
Q

T/F In general, an update rule which is not a non-expansion will not converge.

A

True.

A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

70
Q

T/F MDPs are a type of Markov game.

A

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

71
Q

T/F Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

72
Q

T/F Linear programming is the only way we are able to solve MDPs in linear time.

A

False. First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don’t use LP that solve MDPs (though none in linear time).

73
Q

T/F The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False. The objective of the dual LP is to maximize “policy flow”, subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

74
Q

T/F max is non-expansion

A

True

75
Q

T/F Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

False, only potential based shaping might avoid introducing an sub-optimal policy loop.

76
Q

Generalized MDPs

A

as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration,

77
Q

examples of non-expansion

A

order statistics (max, min), fixed convex combinations

78
Q

T/F If you get a good enough approximation of the value function then you are bound on how far off you are

A

True

79
Q

T/F Policy Iteration converges at least as fast as VI

A

True

80
Q

T/F Policy Iteration is more computationally expensive than VI

A

True

81
Q

Domination

A

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

82
Q

T/F policy iteration won’t get stuck in a local optimal

A

True

83
Q

T/F value non-deprovement in steps of policy iteration

A

true

84
Q

How can we change the MDP reward function w/o changing optimal policy (3 ways)

A

multiply by positive constant, shift by constant, non linear potential based rewards

85
Q

T/F initializing q values is the same as using potential based functions

A

True. Initialize to 0. Random means you are biasing

86
Q

Potential-based shaping will find an optimal policy faster than an unshaped MDP.

A

False… if the selected potential is wrong, it will slow things down.

87
Q

T/F potential functions can speed things up

A

True

88
Q

What is the problem with reward shaping

A

We can screw things up and create suboptimal policies optimized for these “helper” scenarios

89
Q

T/F Rmax will always find the optimal policy for a properly tuned learning function.

A

False (although true in spirit). The problem is the word ‘always.’ As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn’t reached, and if reached it’s not quite optimal.

90
Q

Q-learning converges only under certain exploration decay conditions.

A

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

91
Q

The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.

A

False. You may still need to explore “non-optimal” arms of the bandit since there may be noise in the estimate of its value.

92
Q

T/F Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.

A

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.

93
Q

T/F RL with linear function approximation only works on toy problems.

A

False. Nonlinear approximation can be approximated by linear piece-wise construction.

94
Q

T/F KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.

A

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.

95
Q

T/F With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.

A

False. Baird’s counter-example shows how classic update using linear function approximation can lead to divergence.

96
Q

T/F Applying generalization with an “averager” on an MDP results in another MDP.

A

True. As noted in Gordon (1995), you will have less states…

97
Q

Hoeffding and union bounds

A

a way to know how much we know (certainty). know when we are certain enough

98
Q

Rmax

A

optimism in the face of uncertainty

99
Q

Can combine Rmax bandits

A

stochastic + sequential

100
Q

When does generalization work

A

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

101
Q

Generalization

A

some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

102
Q

T/F averages can converge to optimal solution (but does not always)

A

True. Need anchor points to converge

103
Q

T/F value iteration converges for POMDPs

A

False, but you can get near-optimal

104
Q

T/F POMDPs are easy to solve

A

False. Piecewise linear convex can do it and throw away a lot that are striclty dominated

105
Q

Model based POMDP

A

use expectation maximization to learn the model

106
Q

memoryless POMDP

A

helps to be random

107
Q

Properties of Monte Carlo Tree Search

A

useful for large states, needs lot of samples to get good estimates, planning time indepedent of size of state space, running time is exponential in the horizon

108
Q

Gittens index only works for

A

bandits

109
Q

How does monte carlo tree search get q value estimates

A

random simulation

110
Q

Goal abstrction

A

Modular RL and arbitration

111
Q

Temporal abstractions (hierarchical RL)

A

options and semiMDPs

112
Q

Pavlov strategy

A

cooperate if agree, defect if disagree

113
Q

Subgame perfect

A

always best response independent of history

114
Q

Properties of minimax 1 for 2 player 0 sum games

A

value iteration works, minimaX Q converges, efficent

115
Q

Properites of nash q for multiplayer

A

value iteration does not work, minimax q does not coverge

116
Q

Does coco generalize to n>2 players

A

no

117
Q

what is coco

A

cooperative competitive values

118
Q

T/F POMDPs can be planned

A

False

119
Q

Semi-MDP

A

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

120
Q

T/F b/c SMDPs are abstracting MDP they inherit the hierarchical optimality, convergence and stability of an MDP

A

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don’t matter and helps with exploration