Final Review pt. 3 Flashcards
True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Why?
I would add that its false due to the stochasticity of the problem. We must visit a state several times to increase our confidence in the likelihood of each outcome.
On the other hand, if it were deterministic, one visit of a state would be enough to confidently know the reward for visiting that state in the future
What are the benefits of an off-policy algorithm?
Off policy allows the agent to evaluate and improve a policy that is different from the Policy that is used for action selection. Target Policy != Behavior Policy.
This allows for continuous exploration, learning from demonstration, and parallel learning.
Is it possible that Q-learning doesn’t converge? Is it possible that the point it converges to is not optimal?
Neither is possible. We can prove that Q-learning always converges, and it converges to Q*.
PROVIDED:
1. The sum of the learning rates must not converge. The squared sum of the learning rates must converge.
- All actions must be sampled infinitely often.
Why can you find an optimal policy despite finding sub-optimal Q-values (e.g. via Q-learning)?
Provided the argmax_a Q(s,a) == argmax_a Q(s,a) for all s, then the policy is still the same as the optimal policy even if Q(s, a) != Q(s, a)
True or False: Policy shaping requires a completely correct oracle to give the RL agent advice. Why?
False. Policy shaping needs a certain degree of confidence, but a completely correct oracle is not necessary (it would help though).
the policy shaping routine is dependent on a probability of selecting a good vs bad action. This problem is a function of a parameter C that varies from 0 to 1 indicating how precise/accurate the learner is. If we needed a perfect oracle, the C coefficient would not exist and would just be set equal to 1. We can have some imperfection, but ideally more accurate than not.
True or False: Since Sarsa is an “online” algorithm, it will never learn the optimal policy for an environment. Why?
False. Assuming the appropriate step size is being used, i.e. sum = infinity and sum^2 < infinity, along with adequate exploration of all states, SARSA will converge to an optimal policy and action-value function.
True or False: Rmax will always find the optimal policy for a properly tuned learning function. Why?
False. It does not guarantee optimal, but it can help get to near-optimal results.
True or False: Potential-based shaping will always find an optimal policy faster than an unshaped MDP. Why?
False. Potential-based shaping is not magic. It is only a way to redefine your rewards, and does not always converge faster. But If you initialize Q(s,a) as all zeros, it can give you some head start.
True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP. Why?
False. reward shaping functions that are not potential-based are not guaranteed to preserve the optimal policy for the original MDP.
One perspective of Sarsa is that it looks a bit like policy iteration? Can you tell us which part of Sarsa is policy evaluation and which part is policy improvement?
policy evaluation: update the Q(s,a) value
policy improvement: choose the A’ using epsilon-greedy policy
In an assignment document, it says Sarsa uses TD to do control. What exactly does it mean? Which part of Sarsa is TD?
Sarsa still uses TD error to update the Q function, just as Q-learning.
Why is Sarsa on policy? How does this compare to an off-policy method like Q-learning by following a random policy to generate data?
SARSA is on policy because it directly takes the action A’ going to S’ using an e-greedy policy. Q-learning takes action A and observes the R, S’ for all A and chooses the maximum value.
SARSA: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(Q(S’,A’) - Q(S,A)]
Q-LEarning: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(max_a Q(S’,a) - Q(S,A)]
What is Sarsa? What’s the input and output of this function?
Sarsa is an on-policy learning algorithm. The inputs are the actual (s, a) pair an agent has experienced, and the output is the estimated Q table.