Reinforcement Learning Flashcards

Question

Q: Q-Learning is on-policy because it might not use the selected action at to update the q-values.

Answer 1

A: False, This is why it’s off-policy

Answer 2

A: False, The policy will be same but the actual Q-Values might be different.

Answer 3

A: False, MDP are a subset of markov games.

Answer 4

A: True, It adds the concept of sharing via side payments. A: True, This solution concept is a “Pareto optimal”.

Answer 5

A Nash equilibrium is said to be subgame perfect if an only if it is a Nash equilibrium in every subgame of the game *SOURCE*

Answer 6

"folk theorem" states the notion of threats can stabilize payoff profiles in REPEATED games. *SOURCE*

Answer 7

This is actually describing Grim

Answer 8

A group of agents control the environment jointly. Each agent receives a separate partial observation and agents plan to optimize a single reward function

Answer 9

the paper from the reading shows that human advice can do lead to better results than the oracle because humans are willing to try out multiple winning strategies *Lesson 15*

Answer 10

False, because you can always add a self-transition with no reward to the terminal state(s)?

Answer 11

False. In a Markov process, only the current state matters, not historical actions. recent moves don't necessarily influence the outcomes more than the past moves. There are some RL techniques which assigns more credit the recent moves given an outcome. Markov process is that future state is affected by current state but current state has all info from history.

Answer 12

True, due to the fixed policy we can't make any decisions. "If we know the optimal V values, we can get the optimal Q values only if we know the environment’s transition function/matrix."

Answer 13

False. V∗=maxaQ∗

Answer 14

The answer here is that the optimal policy will not change because this is just a linear shift in rewards for all states

Answer 15

False. It just means that the current state doesn't depend on prior states. The agents themselves can use information from prior states. The current states contains an implicit accumulation of all the past states that were used to get it to the current state.

Answer 16

the optimal value function will have taken the long term rewards into account. A greedy policy in this case is the optimal policy.

Answer 17

the sum of the learning rates must be infinite but the sum of the squares must be finite to guarantee convergence of the value function.

Answer 18

False. Although MC methods are unbiased estimators of the value function when compared to TD methods, TD methods have a significant advantage vs MC methods as they are naturally implemented in an online, fully incremental fashion. With MC methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration. Therefore, TD methods are preferred over MC methods. MC has high variance, zero bias; good for convergence properties, even with function approximation; not very sensitive to the initial value; very simple to understand and use. TD has low variance, some bias; usually more efficient than MC; TD(0) converges to Vpi(s), but not always with function approximation; more sensitive to initial value.

Answer 19

false, in forward view, since we need all time steps to do the computation, it may not be feasible. whereas in backward view , using eligibility traces, we can do the updates. assuming complete data, they are equivalent

Answer 20

For very short episodic problems with deterministic policies that might be justifiable but for very long episodes with complex state space and probabilistic policies an online algorithm is likely going to arrive at an optimal policy in a much smaller number of episodes.

Answer 21

False. Since we already have the model, it is preferable to first try something like policy or value iteration which is faster and doesn't require sampling.

Answer 22

TD(1), for starters, propagates information very quickly. TD(0) does it the slowest. TD(0) outperforms TD(1) empirically according to Sutton as well in the repeated presentations regime. This is because the slow but accurate propagation of rewards is accomodated. it would perform worse in the single presentation regime. in case of single presentation of dataset, the error will be highly dependent on the learning rate and lambda and will climb up for higher values of λ i.e. 1.0

Answer 23

each lambda seems to have the same general trend (starting at the same initial error, going down a little bit, reaching the minimum error, and going back up). But, as Ameet said, the curves do differ significantly in how extreme these different features are. So, I guess it depends on what general curve means. for λ of 0.3, we get a concave or boat like shape whereas for values of 0.8 and 1, we get a "hockey stick" or check mark shape. That indicates that the error climbs up more sharply for those for higher values of α

Answer 24

False, update rules that are non-expansions imply convergence, but convergence is possible with update rules that are not non-expansions. in general, but there might be cases that *not* non-expansions converge (Coco-Q is an example). Additionally, a non-expansion is not necessarily a contraction...

Answer 25

True. Markov Games are a generalization of MDPs. MDPs are a single agent Markov game.

Answer 26

False, LP is not solvable in linear time (nor PI/VI/Q). See Littman, Dean & Kaelbling (2013).

Answer 27

False, we are trying to maximize the "policy flow." The dual of the primal is its inverse, if you maximize the dual and minimize the primal, it will be the same answer.

Answer 28

True, if correctly implemented, reward shaping does not change the optimal policy.

Answer 29

False. A well designed potential-based shaping may help an algorithm to converge faster, but there’s no guarantee of that.

Answer 30

True. Assuming the reward is bounded (and the bound is known), the optimal policy will be the same. Rmax encourages the learner to explore only if a better reward may be discovered. If the learner has already explored the best rewards, the greedy policy is already optimal.

Answer 31

True. Q-learning is guaranteed to converge under the assumption that the states are explored enough.

Answer 32

False, you cannot simply explore forever because to accurately gauge the reward from a bandit takes infinite samples. You must first explore some, then exploit as much as possible before the finite game ends.

Answer 33

False, if entire state and action space has been presented repeteadly ( which may be possible due to correct setting of exploration and exploitation ),SARSA can converge to the optimal policy. Although it highly depends on the provided data, It's possible to learn the optimal policy for an environment. If it happened to be presented with perfect information, it could converge to an optimal policy. Never say never.

Answer 34

False. Linear Function may not work, but there is no guarantee of that. You can always create an example that can blow up the approximation but this is not always the case Though it's not *likely*, saying "only" is too strong. Non-linear approximators (DNN) are used heavily in theory and practice to approximate complex value functions.

Answer 35

False, there are certain problem classes that KWIK is worse than MB for, such as guessing boolean conjunctions.

Answer 36

False. Linear function approximations are not guaranteed to converge.

Answer 37

True. The averager itself can be seen as MDP.

Answer 38

False, any MDP can be represented as a POMDP and they can obviously be represented as MDPs. The question is a bit tricky because there are POMDPs that cannot be represented as MDPs, but this is not always the case as the question implies.

Answer 39

False. The definition of 'work' is a bit loose here. Algorithms may inefficiently 'work'. They demonstrate value iteration for pomdps In the lecture

Answer 40

True. Options can be treated as temporal abstract actions, so MDP with options form another Semi-MDP

Answer 41

False. The Nash equilibrium coincides with the minimax strategy which is mixed. Rock Paper Scissors I think is a counterexample

Answer 42

False, it always exists because in a 2 player zero-sum finite deterministic game with perfect information, minimax = maximin so there always exists an optimal pure strategy Rock paper scissors is an imperfect information game edit: to add, also not deterministic (especially if your opponent uses a randomized policy!), so not a counterexample

Answer 43

there are more things to consider. Time, space, and efficiency as noted are all important considerations.

Answer 44

False, they are strongly related. Some RL algorithms are based on the Bellman Operator which is a contraction mapping. contraction mappings are non-expansions. It does not hold the other way though.

Reinforcement Learning Flashcards

(69 cards)