TD Learning Flashcards
What are the three main categories of RL algorithms?
Model based (most information required, but generally easy) Value function based/model free Policy search (most general/simple, but generally difficult)
Describe model based learning
Model based learning attempts to derive the overall model for the MDP and compute Q* and Pi from the model using an MDP solver (such as VI/PI)
(s,a,r)* -> model learner <->T/R -> MDP Solve -> Q* -> argmax - Policy
Describe value function (model free) based learning
Attempts to directly learn the value function based on states, actions, and rewards.
(s,a,r)* -> Value update <-> Q -> argmax -> Policy
Describe policy search learning
Attempts to directly derive the optimal policy by updating the policy directly.
(s,a,r)* -> policy update <-> policy
Under what criteria does the value function estimate Vt(S) converge to the true value V(s) as t -> inf? In general what values satisfy these criteria?
- Sum(alpha_t) -> inf
- Sum(alpha_t^2) < inf
alpha_t = 1/t^n where n in (1/2, 1]
What is the TD(lambda) update rule?
Episode T
For all s, e(s) = 0 at start of episode, V_t(s) = V_t-1(s)
After s_t-1 -> r_t -> s_t (step t):
e(s_t-1) = e(s_t-1) +1
For all s:
V_t(s) = V_t(s) + alpha_t (r_t + gamma*V_t-1(s_t) - V_t-1(s_t-1)) e(s)
e(s) = lambda * gamma * e(s)
What are some issues with TD(1)?
TD(1) does not use all data that is available. It can get stuck with bad estimates for a long time if it does not see certain paths. Inefficient, high variance.
Which TD(lambda) version is equivalent to maximum likelihood? Under which criteria?
TD(0), if finite data is presented infinitely
Empirically, how does TD(lambda) tend to behave when varying lambda?
Error decreases initially as lambda increases but eventually increases as lambda approaches one.
What are some issues with TD(0)
We get a biased estimate because we are using one step estimation