Final Flashcards

Question

Unsupervised Learning

Answer 1

f(c). Clusters descripition

Answer 2

y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z

Answer 3

Markov Decision Processes.

Answer 4

States, transitions (model), actions, rewards and create a policy.

Answer 5

Only the present matters AND things are stationary (rules/the world doesn't change over time)

Answer 6

In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward

Answer 7

the problem of determining the actions that lead to a certain outcome in sequence

Answer 8

Policy(s,t).. policy is function of state AND time

Answer 9

if you prefer one sequence of states over another today, you prefer the same sequence tomorrow

Answer 10

Use discounted future rewards (use gamma)

Answer 11

Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward

Answer 12

start with initial policy (guess) evaluate: given policy, calculate utility improve: Policy at t+1 is the action that maximizes the utility

Answer 13

off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning) on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy's action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)

Answer 14

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

Answer 15

True. and with even more learning because updates don't have to wait for the episode to be over

Answer 16

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

Answer 17

TRUE if we run over the data over and over again

Answer 18

False. TD(1) typically has more error than TD(0)

Answer 19

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

Answer 20

Difference between reward (value estimates) as we go from one step to another

Answer 21

False, usually

Answer 22

False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5

Answer 23

Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP

Answer 24

False. Model is the agent's idea of the environment

Answer 25

RL: model is unknown, the agent performs actions Planning: model is known, the agent performs computations (know planning)

Answer 26

||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G

Answer 27

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work "online")...

Answer 28

you care only about the current (short-sighted)

Answer 29

you care about the future a lot (far-sighted)

Answer 30

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

Answer 31

An MDP is a Markov Reward Process with decisions. Includes actions and stochasticity

Answer 32

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

Answer 33

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

Answer 34

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset. Let's say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning

Answer 35

In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.

Answer 36

False Generally, because it depends on the problem and a number of other things.

Answer 37

false, if you have the model one would expect it to be more efficient to use it.

Answer 38

False. TD(0) propagates more slowly.

Answer 39

Generally True, same shape but different minima

Answer 40

True. A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

Answer 41

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

Answer 42

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

Answer 43

False. First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don't use LP that solve MDPs (though none in linear time).

Answer 44

False. The objective of the dual LP is to maximize "policy flow", subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

Answer 45

False, only potential based shaping might avoid introducing an sub-optimal policy loop.

Answer 46

as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration,

Answer 47

order statistics (max, min), fixed convex combinations

Answer 48

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

Answer 49

multiply by positive constant, shift by constant, non linear potential based rewards

Answer 50

True. Initialize to 0. Random means you are biasing

Answer 51

False... if the selected potential is wrong, it will slow things down.

Answer 52

We can screw things up and create suboptimal policies optimized for these "helper" scenarios

Answer 53

False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal.

Answer 54

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

Answer 55

False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.

Answer 56

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.

Answer 57

False. Nonlinear approximation can be approximated by linear piece-wise construction.

Answer 58

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.

Answer 59

False. Baird's counter-example shows how classic update using linear function approximation can lead to divergence.

Answer 60

True. As noted in Gordon (1995), you will have less states...

Answer 61

a way to know how much we know (certainty). know when we are certain enough

Answer 62

optimism in the face of uncertainty

Answer 63

stochastic + sequential

Answer 64

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

Answer 65

some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

Answer 66

True. Need anchor points to converge

Answer 67

False, but you can get near-optimal

Answer 68

False. Piecewise linear convex can do it and throw away a lot that are striclty dominated

Answer 69

use expectation maximization to learn the model

Answer 70

helps to be random

Answer 71

useful for large states, needs lot of samples to get good estimates, planning time indepedent of size of state space, running time is exponential in the horizon

Answer 72

random simulation

Answer 73

Modular RL and arbitration

Answer 74

options and semiMDPs

Answer 75

cooperate if agree, defect if disagree

Answer 76

always best response independent of history

Answer 77

value iteration works, minimaX Q converges, efficent

Answer 78

value iteration does not work, minimax q does not coverge

Answer 79

cooperative competitive values

Answer 80

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

Answer 81

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don't matter and helps with exploration

Final Flashcards

(120 cards)