8/9 - Approximate Methods and Deep RL Flashcards
2 Problems with tabular approaches (Q/SARSA)
Table is too big in a big domain
Limited by learning each individual state regardless of the other states
Approximate methods seek to approximate the value function or state action value function
What is the catch?
Tabular: Updating one cell does not affect the others
Approximation: One update of the overall approximation function affects all inputs.
- Making it more accurate for one state means other approx. for other state get less accurate.
How can approximation be improved?
Compare the approximation with the true value according to the Mean Squared Value Error.
sum of all states: weighting factor multiplied by (the difference between the real and approximate value) squared
VE(w) =° sum(s in states)(μ(s)[vπ(s)-^v(s,w)]^2
SGD weights expression
Wt+1 = Wt + α[Vπ(St)-^v(St,Wt)]*deriv(^v(St,Wt)
Vπ is the real value function
^v is the predicted value(v hat not power of v)
So use the R+γ*^v(S’,W) - TD error prediction instead of Vπ
Semi Gradient TD(0) for approximation of target values
Input: policy π, a differentiable function ^v that results in a Reward and ^v(terminal,.) = 0
Parameter α > 0 (step size)
Initialise value-function weights arbitrarily (eg w = 0)
For each episode:
- Init S
- Loop for each step of episode
- - Choose A ~ π(.|S)
- - Take action A, observe R and S’
- - w <- SGD weights expression
- - S <- S’
until S’ is terminal
Policy gradient methods
Select actions based on preference
π(α|s,θ) = Pr{At = a | St = s, θt = θ}
θ is the vector of params that express a preference
θt+1 = θt + α*deriv(J(θt))
Monte-Carlo Policy Gradient Control
Input: a differentiable polict parameterisation π(a|s,θ) - neural net
param: step size a > 0
Init policy param θ in R^d’ (eg to 0) - intiialise net with random weights
Forever for each episode:
- Generate an episode S0,A0,R1,…,ST-1,AT-1,RT, following π(.|. ,0)
- Loop for each step of the episode t = 0,1,…,T-1:
- - G <- sum(k->t+1 to T)(Rk)
- - θ <- θ + αγ^tG*deriv(ln(π(At|St,θ)
What is the upside down triangle symbol?
gradient eg derivate of the error with respect to all parameters
Actor Critic methods
Combines policy methods and q value approaches.
Two networks; policy (π) and value (v)
policy net gives you probabilty of action given state whereas value only gives value at particular state, regardless of actions
Two learning rates. Initialise both networks.
Start with state and identity matrix
Choose action from policy
Observe next state and reward
Update delta δ with TD error
Use this to update both networks (Identity matrix multiplied with delta on gradient as well as step size)
Set identity matrix to I multiplied by γ
S’ is then the current state.
Who is the actor and who is the critic?
The policy network is the actor (it decides the actions)
The value network is the critic (criticises actions etc)