Exam Flashcards
What are the key similarities and differences between DP and TD methods?
Similarities
- They are used to find/estimate V(s), Q(s,a) and/ or optimal policies
- Both use bootstrapping to update the valuefunction and/or policy iteratively
Differences:
- DP learns from the MDP using P and R. TD is model-free and learns from sample traces
What are the analougus TD methods to:
1) Policy evalution
2) Policy iteration
3) Value iteration
1) TD learning
2) SARSA
3) Q-learning
For and MDP, what are the:
1) Belmann equation for V(s)
2) Belman optimaility equation for MDP V(s)?
1) V(s) = Sum_a pi(s,a) Sum_s’ P(s,a,s’) (r(s,a,s’) + yV(s’))
2) V(s) = max_a Sum_s1 P(s,a,s’) (r(s,a,s’) + yV(s’))
What criteria does a problem need to have to be solvable by DP?
1) Optimal substructure
The solution is a combination of optimal solutions to subproblems
2) Overlapping subproblems
A recursive algorithm solving the problem should solve “the same” problem over and over again
What is the policy improvement theorem
Let: pi, pi’ denote deterministic policies.
If Q_pi(s, pi’(s)) >= V_pi(s) for all s then the policy pi’ is as good or better than the old policy.
V_pi’(s) >= V_pi(s) for all s.
What is value iteration?
Policy iteration with only one step of policy evaluation.
What is the idea behind Monte Carlo methods?
Use the average of samples to estimate true values
When can we use TD, but not MC?
TD:
1) TD can learn from incomplete episodes
2) TD can learn online after each step
3) TD works in non-terminating environments
MC:
1) MC needs a complete episode
2) MC has to wait until the end of an epsiode to learn
3) MC does not work in non-terminating environments
Compare the properties of TD and MC
TD:
1) Has low variance, some bias
2) Good convergence, but not necessarily for function approximation
3) TD usually works best in markov environments
4) More sensitive to inital value
MC
1) Has high variance, no bias
2) Good convergence even for function approximation
3) Usually works best in non-markov environments
4) Easy to understand an use
Compare on-policy and off-policy methods
On-policy methods use one method to determine behaviour during sample collection and improves this method.
Off-policy methods use a target and behaviour policy. The behaviour policy determines the which action to take under sample collection, while the target policy is the policy we wan’t to improve.
What is the GLIE (Greedy in limit with infinite exploration) condition?
Let Nk(s,a) be the number of times we visited state s,a after k iterations. Then the GLIE condition is that:
1) lim(k->inf) Nk(s,a) = inf for all k
2) lim(k->inf) pi(s,a) = greedy policy
Name one e_k making e-greedy GLIE. (k denotes that we have done k iterations).
e_k = 1/k
What is the update for Monte Carlo model free control?
Q(s,a) = Average(Returns(s,a))
Under what conditions does SARSA converge?
GLIE for policy and alpha is a Robbins-Monroe sequence
What is a Robbins-Monroe sequence?
A sequence a_t is a robbins monroe sequence if:
1) Sum a_t = inf
2) sum a_t^2 < inf