Final Exam Flashcards

Question

What’s the formula to view TD(\lambdaλ) as a function of n step estimators?

Answer 1

1-step estimator will have much higher bias due to bootstrapping whereas Monte Carlo will have much higher variance because it depends on a lot of signals.

Answer 2

With TD you do not necessarily need a model, whereas with dynamic programming you do. TD reduces the variance of the expected return compared with MC. Dynamic programming methods need a model, Monte Carlo methods will have high variance and TD methods can have high bias

Answer 3

False, The sum of the learning rates must not converge but go to infitinty. However, the sum of the squares of the learning rates must converge.

Answer 4

MC is an unbiased estimator, but it is also has high variance. TD methods are preferred because they minimise variance and also approximate the ML estimate of an MDP. TD methods can also make a trade-off between bias and variance through TD(lambda).

Answer 5

Technically True, but backward methods offer a convenient way of learning online when experience cannot be repeated or is expensive to generate. If episode data is cheap or stored, then using a backward view (with eligibility traces) does not have a major benefit.

Answer 6

False, there are cases where either will do better. E.g. online algorithms can better deal with non-stationary tasks.

Answer 7

false. If we are simply sampling from (transition, reward) then there is no concept of states to which to apply TD-methods, which require a temporally-ordered sequence of (S,R) to apply. If there is no temporal order to the model, then it makes no sense to apply a TD method. An MC method would be better.

Answer 8

False. TD(1) is MC. Assuming an update is applied at the end of an episode, it performs worse. The return is apportioned to each state proportionally to the number of times it is visited, which means it diverges with repeated presentations.

Answer 9

False, empirically we see different curves. The best value of lambda will vary across problems. Short episodes may benefit from smaller values of lambda for example.

Answer 10

False, non-expansions and contractions are the only things guaranteed to converge because the gradients will shrink to 0 as the operations are applied. When saying that it is not a non-expansion, it only indicates that there is no guarantee that the updates will approach each other. However, I believe it as also not certain that the updates won't approach each other. It might be that convergence still happens.

Answer 11

False, non-expansion means that the signal is shrinking monotonically. It's the less restrictive version of a contraction

Answer 12

An eligibility trace is a weight value corresponding to how many time steps have passed since that state "participated in producing an estimate value" (Sutto Barto p.287). The eligibility trace defines how much a given state estimate is updated. It can be used to calculate TD updates more efficiently. It is used to generalize TD(0) and TD(1) methods to TD(\lambda) methods.

Answer 13

"In the following section, we prove that in fact it is linear TD(0) that converges to what can be considered the optimal estimates for matching future experience - those consistent with the maximum-likelihood estimate of the underlying Markov process."

Answer 14

My guess is that it's because it is a low variance estimate with high bias and under repeated presentations that bias would go away and therefore it should converge?

Answer 15

Because TD(>0) algorithms will have variance and will therefore begin to be off. TD(0) on the other hand has no variance but is biased because of bootstrapping. On repeated presentations the bias is eliminated and therefore will converge with 0 error.

Answer 16

Pt is the prediction at time t ... the estimated state value ∇wPt is the gradient, which is the partial derivative with respect to the weights of the expectations ... in the paper it was the value of the x vector, which was the input.

Answer 17

∇wPt is xt... its use is to 'turn off' the all the weights for the 'other' states. Only the weight associated with the value 1 is updated or used for the calculations.

Answer 18

While written differently, it matched the Semi-gradient TD(λ) for estimating algorithm in the book.

Answer 19

TD(0) updates slowly. This is because it's a single step and unable to overcome it's initial bias. The other TD(lambda) learn from other steps so get more experience per iteration so it updates faster. Per the paper, if there were more presentations, the handicap would go away.

Answer 20

It would look more like Figure 3 as the experiment would be running to convergence. For 10 sequences, we are just discovering what can be learned at the start of training. Running infinitely, we'll be more like experiment 1, running to convergence. And I think the curves will all overlap. There is only one environment so there is really only one solution. The other curves in Figure 4 are because we're starting from different conditions.

Answer 21

My guess is the learning rate causing the alg to transition from undershooting to overshooting the true value given the calculated TD error term

Answer 22

"RL with actions" A control task is one where the agent develops a policy that it uses to interact with the environment. Contrast this with a prediction method, which only learns to predict the return, V(s) or Q(s,a). TD algorithms are a way of learning to predict the value function, which control methods can then use to develop a policy e.g. by acting epsilon-greedily with respect to the value function.

Answer 23

SARSA is an on-policy model-free control method. It learns Q using TD learning, and selects actions using an epsilon-greedy (or soft) policy. The input is the state, reward, next state and the output are actions selected based on the learned Q-function.

Answer 24

SARSA is on-policy in that it selects the actions A, A' using the current best policy. SARSA selects the update value from the q-table by choosing Q(S',A') based on S' and the POLICY(S') action. Then it follows that (S',A') for the next iteration. Q-learning selects the update value from the q-table by choosing the maximum Q(S', any possible action) and ignoring what the current best policy would select. Then it follows the (S', POLICY(S')) for the next iteration. The POLICY(S') may not match the action chosen by the previous iteration.

Answer 25

The value function estimation in SARSA is done with TD. The value function is used to do control by selecting actions (epsilon) greedily.

Answer 26

SARSA uses Generalised Policy Iteration i.e. take the current policy, evaluate it , update the policy using the new information or observed reward, repeat. The TD update is policy improvement, and the action selection is policy evaluation. SARSA looks almost exactly like policy iteration except that it entirely uses the current policy to update values. Once it's done with that it will do an improvement.

Answer 27

True Reward shaping won't change the policy (so still finds the original optimal policy); but, it may find it faster.

Answer 28

False If set up wrong can cause negative potential feedback loops

Answer 29

False. False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results. If the estimates of the MDP are sufficiently close to the original MDP, then the policy will be near-optimal, not necessarily optimal. Rmax leads to exploration of unknown states, which generates knowledge that can be used to estimate the value of those states (using Hoeffding Bounds) over time. If the parameters defining when "enough knowledge is enough" lead to the estimate of the MDP being tolerably close to the original, then Rmax can be said to lead to a near optimal policy with a properly tuned learning function.

Answer 30

False. SARSA can learn the optimal policy but does less exploration than an off-policy algorithm. It is possible SARSA gets stuck in local optima, but not necessarily

Answer 31

False. The Agent can assign a non-100% confidence in the correctness of the advice given. The Agent can also continue to learn from its own experiences

Answer 32

Yes The policy space (selecting the best action) only needs Q-values to be 'sorted' appropriately. By that I mean that the sub-optimal Q-values still have the best action with the highest Q-value for a particular state. Then the best action won't change even though the Q-value has farther to go to become optimal.

Answer 33

Q-learning will always converge to optimal (and correctly valued) policies under 2 conditions: - Given a sufficiently small learning rate- one that converges when squared and diverges when added. I.e, 1/2, 1/4, 1/8... - Each state is visited infinitely often https://stackoverflow.com/questions/59709726/criteria-for-convergence-in-q-learning (Practically, the state-space may be too large for Q-learning to be computationally tenable in most real-world problems)

Answer 34

An off-policy algorithm can produce accurate state-value estimates independent of how good its policy is A benefit of this is that it may do more exploration and be less likely to get stuck in local-optima https://datascience.stackexchange.com/questions/13029/what-are-the-advantages-disadvantages-of-off-policy-rl-vs-on-policy-rl

Answer 35

False exploration is required to learn the reward distribution ... and therefore the bandits with highest rewards. Too much exploration is wasting time on poor choices. exploitation is required to receive rewards and get some benefit from the lessons learned so far. Too much exploitation is possibly missing better choices ... or learning them too slowly.

Answer 36

The continuous states of the Lunar Lander problem can be dealt with using function approximation to approximate the Q-function. One pro of this technique is that it allows us to generalize states we've seen to similar states that have not been encountered before. A con of this technique is that the Q-learner is no longer guaranteed to converge when using a function approximator.

Answer 37

Yes, reward shaping can be used to help guide the policy search by encouraging the agent to take specific actions to follow a specific trajectory. needs to be a potential based rewarding scheme

Answer 38

In general, we can use function approximation for Value, Q, Policy, and Transition/Reward functions. The input can either be the state action pair or just the state. The output in the case of state action pair is the value for that pair, if the input is just the state it could be just the value of that state or it could be the values for all actions for that state.

Answer 39

We cannot because Q learning only converges to the optimal values under the assumption that you visit each state-action pair an infinite number of times

Answer 40

True ... for a continuous state space you have infinite states so you'd need infinite equations. False ... if you discretize the continuous space in to discrete states, then you can use a linear function approximation.

Answer 41

False, Baird's star problem is not guaranteed to converge.

Answer 42

True, any generalization of an MDP would result in another MDP.

Answer 43

False. You can go from an POMDP over some state-space to a larger MDP over a belief-space.

Answer 44

False. Lot of algorithms will work here. PSR as noted is a great counter example.

Answer 45

It is the "knows what it knows" framework. It provides a set of algorithms that can map directly the input X to output Y. It makes guarantees that it will never be wrong and instead just say it doesn't know an answer if an out of distribution input is found. It can make bounding guarantees for how many examples it needs before it can make a non-unknown prediction.

Answer 46

PAC - Correct labels for the outputs are given in training data and it will not make mistakes within a certain probability and be accurate to a certain degree. It is a good choice for supervised learning problems. It allows for some mistakes due to the approximate correctness KWIK - It does not allow for mistakes, and can be given inputs where the outcome is incorrect. It also is trained with on-policy data. Limited by the number of hypothesis or requests/data it can make to the problem

Answer 47

The size of the hypothesis space minus 1: |H| = N * (N - 1) Bound = |H| - 1 N - The size of the observation space (i.e. in Homework 5, this was the number of patrons)

Answer 48

Yes because you are given the required information directly. The worst case is now just N-1. The worst case would be where all patrons come on first night and one drops out each succeeding night. For 8 patrons, you know on event 7 (worst case) when you're left with two patrons. You've found both; but, not identified which is which. One more time with them split to identify them exactly. ... so maybe worst case is N.

Answer 49

False. For some hypothesis classes, KWIK can take more time than MB since it predicts "don't know" whereas MB can make a guess when uncertain.

Answer 50

I believe -1 because when forced to hava a pure strategy, you can't protect yourself from the other player using the best action against you. I think the Security Profile would be 0 with the best strategy being to choose random actions.

Answer 51

a line between 1,-1 & -1,1 that also goes through 0,0

Answer 52

I believe it would be for each player to have a mixed strategy to choose each action randomly?

Answer 53

IDK check notes

Answer 54

By using threats, any payoff profile that strictly dominates the minimax profile can be realized.

Answer 55

True because MDP's can be a generalized form of Markov Game's when there is only 1 other opponent

Answer 56

False. Dynamic programming techniques like value iteration and policy iteration can solve an MDP for its optimal policy in polynomial time.

Answer 57

False, the objective is to maximize the "policy flow".

Answer 58

False, a pure Nash strategy is one without randomization in actions to take while a mixed Nash strategy is one where there is randomization (take action with probability p)

Answer 59

True. Even certain very simple games, like coordination game, have no optimal pure strategy.

Answer 60

False. With any finite games the "folk theorem" will not apply as it becomes rational to cease cooperation before the last turn.

Answer 61

False. This is the "Grim Trigger"... Pavlov cooperates whenever both players make the same choice (CC or DD), otherwise chooses D

Answer 62

False. It indicates that every subgame be a Nash equilibirum.

Answer 63

An option is a generalization over an action. It can be the composition of multiple actions.

Answer 64

SMDPs are an MDP where transition dynamics can depend on time. It is called semi-Markov because now the transition no longer depend on the current state but on the time elapsed in that state.

Answer 65

yes... depends on how you construct the options. Constructed poorly, you do not get the convergence guarantees that you do over the MDP

Answer 66

IDK False because if an option can not be execute for all states, it could result in an error when implemented saying the action is not valid. There would be a problem using plain value iteration when used on SMDP's

Answer 67

No, it only guarantees hierarchical optimality.

Answer 68

True. The options are really just a combination of many atomic actions, rewards, discounts, and state transitions. You could conceivably unravel any option to get back to a conventional MDP.

Answer 69

The Agents only cooperate if each benefits more from cooperating than acting alone. In many cases, cooperation leads to one Agent getting a high reward with the other Agent not being as well rewarded. Without a side payment from the Agent that was highly rewarded, the poorly rewarded Agent would not have reason to cooperate The reward must be high enough that it's the best option for the poorly rewarded Agent, while low enough that the well-rewarded Agent benefits more than acting alone even after paying out

Final Exam Flashcards

(97 cards)