Review Questions Flashcards

Question

True or False: Backward and forward TD(λ) can be applied to the same problems. Why?

Answer 1

True. Both backward and forward TD(λ) will converge on the same problem. Backward TD(λ) typically is easier to compute.

Answer 2

False. Even though MC is unbiased, it has higher variances, therefore TD methods may outperform MC in terms of learning performance. Additionally, they could require more samples to converge on the correct values

Answer 3

False. Sum of learning rates must diverge. Sum of the square of learning rates must converge.

Answer 4

Compared to DP: it's model free Compared to MC: lower variance, online, incremental method

Answer 5

MC: High variance, 0 bias. Good Convergence. Not sensitive to initial value. Simple to understand and use. Effective in non-Markov. TD: low variance, medium/high bias. More efficient than MC. TD(0) converges with LP to optimal V(s). It is not guaranteed to converge with function approximation. More sensitive to initial value. Exploits Markov Property.

Answer 6

The state value of the terminal state in an episodic problem should always be 0 (since the agent terminates)

Answer 7

The value of the state is the expected return. (Sutton and Barto, p. 92)

Answer 8

'Using past experience with an incompletely known system to predict its future behavior'

Answer 9

In model-free methods, we do not need to learn transition probabilities and rewards explicitly. We learn value function and/or optimal policy directly instead.

Answer 10

TD is short for temporal difference, and is a value-based learning algorithm.

Answer 11

It indicates for a given lambda, the error is minimum for some intermediate alpha. For low alpha, the weight update is taking place slowly which might lead to high bias. For high alpha, the weight update is taking place too fast which might lead to high variance.

Answer 12

The shape of the curve should be similar, but the error values should all decrease because the impact of the random noise goes down as the number of sequences increases

Answer 13

TD(0) is MC and requires multiple iterations to fully train itself. in Fig 5, only 1 iteration was presented which did not allow TD(0) to converge on optimal value function.

Answer 14

The weight vector was not updated after each sequence and only used to update the weight after the complete presentation of the training set. This likely was done to account for the stochastic nature of the problem and allow for convergence in minimal sequences and training sets.

Answer 15

I think the one in the class was applied to the Bellman equation. Therefore, it has the reward terms as well as the value terms in the equation. For the Sutton paper, it was much simpler.

Answer 16

In the linear case, TD(0) minimizes the error for future experience and converges to the ML estimate of the underlying Markov process.

Answer 17

With repeated presentations of finite data in TD(0), the next state for a given state occurs with a certain probability. The value of a given state is updated more often towards the target (sum of the immediate reward r plus single discounted value of immediate next state V(s')) according to how often those next states occur (frequency is proportional to their transition probabilities from the originating state), leading essentially to the expected value of that state at convergence. Maximum likelihood estimates are also computed based on this same probabilities, hence it is expected that TD(0) will converge towards MLE

Answer 18

A temporary record of the occurrence of an event (think the exponentially decaying graph showing (1-λ)^n * λ)

Answer 19

False. The exploration is still needed since there can be noise of the estimated value

Answer 20

Off policy allows the agent to evaluate and improve a policy that is different from the Policy that is used for action selection. Target Policy != Behavior Policy. This allows for continuous exploration, learning from demonstration, and parallel learning.

Answer 21

Neither is possible. We can prove that Q-learning always converges, and it converges to Q*. But it may take intractably long.

Answer 22

If Q is close enough to Q*, then we could find pi* through Q. This is because the policy contains less information than Q function.

Answer 23

False. Policy shaping needs a certain degree of confidence, but a completely correct oracle is not necessary (it would help though).

Answer 24

False. Sarsa will use the new information to update its Q table, and may learn the optimal policy if the Q table is accurate enough.

Answer 25

False. It does not guarantee optimal, but it can help get to near-optimal results.

Answer 26

False. Potential-based shaping is not magic. It is only a way to redefine your rewards, and does not always converge faster. But If you initialize Q(s,a) as all zeros, it can give you some head start.

Answer 27

From the lecture, reward shaping via scaling, shifting, or potential functions preserves the optimal policy. Considering there are other reward shaping techniques that exists, I'd say this is False. When not careful, an agent could enter a sort of sub-optimal, feedback loop with the environment, i.e. just dribbling a soccer ball without shooting because rewards were given for time spent dribbling.

Answer 28

policy evaluation: update the Q(s,a) value policy improvement: choose the A' using epsilon-greedy policy

Answer 29

Sarsa still uses TD error to update the Q function, just as Q-learning.

Answer 30

SARSA is on policy because it directly takes the action A' going to S' using an e-greedy policy. Q-learning takes action A and observes the R, S' for all A and chooses the maximum value. SARSA: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(Q(S',A') - Q(S,A)] Q-LEarning: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(max_a Q(S',a) - Q(S,A)]

Answer 31

Control task means the action is included. It estimates the action-value function Q(s,a). TD algorithms can be used for either prediction or control. Not all TD methods a control algorithm.

Answer 32

False. RL algorithm also works for POMDP.

Answer 33

False. MDP is a special kind of POMDP.

Answer 34

True. Any generalization of an MDP results in another MDP

Answer 35

False. it may not even converge. Consider the Baird counterexample.

Answer 36

True. Because a linear function approximation would fail to capture non-linearities and feature interactions in a continuous state space.

Answer 37

False. Adding functional approximator might leads to divergence. Like we have seen in DQN for project 2.

Answer 38

We can approximate action value, Q, which takes state and action pairs as inputs.

Answer 39

Reward shaping could be useful given the high-dimensional state space and the agent is being trained to reach a particular point. I think it will accelerate the learning.

Answer 40

DQN is very useful to handle high-dimensional state space such as Lunar Landing. Other methods include Model-based DreamerV2, imitation learning, and different policy gradient algorithms such as REINFORCE, PPO, A2C, and SAC. While these algorithms provide superior accuracy, they are difficult to train because of non-convexity.

Answer 41

It makes sense in the case that by cooperating neither party is loosing more than by not cooperating. By taking the coco equation the maximum of both cooperating plus the maxmin (maximum negative reward inflicted on other agent) needs to be >0 for coco side payments to work. Each Coco agent will have an inverse side payment (if one agent gives one (-1) the other agent gets one (+1)).

Answer 42

False the strategy is itself an equilibrium from multistage perspective, it doesn't require each stage it's the Nash, nor it requires the existence of Nash for each stage, as long as multistage angle it's a Nash.

Answer 43

False. This is grim trigger. In pavlov, once the opponent defects we will defect until the opponent defects again. In which case, we will cooperate.

Answer 44

False, the folks theorem describes a set of payoffs that can results from Nash strategies in a repeated game. Therefore, it does not apply in one-shot game since it only has 1 episode.

Answer 45

False, in 2 player, zero-sum, deterministic game with perfect information, there is always a pure optimal strategy.

Answer 46

False. The objective of the dual LP is to maximize the reward carried by the policy flow

Answer 47

True. LP is the only way to guarantee convergence in polynomial time.

Answer 48

True. MDP can be considered as a single-agent Markov game, or it could be a multi-agent Markov game when there is only one agent contributes to the reward and transition function.

Answer 49

False. The class of singletons H can be learned in a Mistake-bound model with mistake bound of only 1, while in the worst case KWIK can take 2^{n}-1 examples to learn since it doesn't know the answer until it has seen its first positive example.(https://www.cs.cmu.edu/~avrim/Papers/kwikmb.pdf)

Answer 50

If we are given two answers "did peacemaker come" and "did instigator come" I think we can apply Algorithm 6 from the KWIK paper. Since the size of each hypothesis class is n, the combined KWIK learner has a worst-case bound of (1-k) + sum(Hi) = 2n-1

Answer 51

n*(n-1) - 1

Answer 52

To improve efficiency of exploration

Answer 53

Know What It Knows, designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to. It can also be used for anomaly detection where it requires some degree of reasoning about whether a recently presented input is predictable from previous examples.

Answer 54

True. Options over MDPs form SMDPs which can be converted to MDPs.

Answer 55

True, we use observations in place of actions and discounted rewards in place of rewards

Answer 56

With a particular choice of option there is no guarantee that it would end up with an optimal policy

Answer 57

Value iteration is guaranteed to converge with an SMDP but not necessarily to the optimal policy. Am I missing something?

Answer 58

SMDPs are Markov Decision Processes that use options instead of discrete "atomic" actions. These options are allowed to take variable time instead of discrete time.

Answer 59

(I, Pi, Beta) : I = the initialization set of states, Pi = the policy to take with that option, and Beta = the termination set of states.