TD Learning Flashcards

Question 1

Q

What are the three main categories of RL algorithms?

Answer

A

Model based (most information required, but generally easy)
Value function based/model free 
Policy search (most general/simple, but generally difficult)

Question 2

Q

Describe model based learning

Answer

A

Model based learning attempts to derive the overall model for the MDP and compute Q* and Pi from the model using an MDP solver (such as VI/PI)
(s,a,r)* -> model learner <->T/R -> MDP Solve -> Q* -> argmax - Policy

Question 3

Q

Describe value function (model free) based learning

Answer

A

Attempts to directly learn the value function based on states, actions, and rewards.
(s,a,r)* -> Value update <-> Q -> argmax -> Policy

Question 4

Q

Describe policy search learning

Answer

A

Attempts to directly derive the optimal policy by updating the policy directly.
(s,a,r)* -> policy update <-> policy

Question 5

Q

Under what criteria does the value function estimate Vt(S) converge to the true value V(s) as t -> inf? In general what values satisfy these criteria?

Answer

A

Sum(alpha_t) -> inf
Sum(alpha_t^2) < inf

alpha_t = 1/t^n where n in (1/2, 1]

Question 6

Q

What is the TD(lambda) update rule?

Answer

A

Episode T
For all s, e(s) = 0 at start of episode, V_t(s) = V_t-1(s)
After s_t-1 -> r_t -> s_t (step t):
e(s_t-1) = e(s_t-1) +1
For all s:
V_t(s) = V_t(s) + alpha_t (r_t + gamma*V_t-1(s_t) - V_t-1(s_t-1)) e(s)
e(s) = lambda * gamma * e(s)

Question 7

Q

What are some issues with TD(1)?

Answer

A

TD(1) does not use all data that is available. It can get stuck with bad estimates for a long time if it does not see certain paths. Inefficient, high variance.

Question 8

Q

Which TD(lambda) version is equivalent to maximum likelihood? Under which criteria?

Answer

A

TD(0), if finite data is presented infinitely

Question 9

Q

Empirically, how does TD(lambda) tend to behave when varying lambda?

Answer

A

Error decreases initially as lambda increases but eventually increases as lambda approaches one.

Question 10

Q

What are some issues with TD(0)

Answer

A

We get a biased estimate because we are using one step estimation

TD Learning Flashcards

(10 cards)