8/9 - Approximate Methods and Deep RL Flashcards

Question 1

Q

2 Problems with tabular approaches (Q/SARSA)

Answer

A

Table is too big in a big domain
Limited by learning each individual state regardless of the other states

Question 2

Q

Approximate methods seek to approximate the value function or state action value function

What is the catch?

Answer

A

Tabular: Updating one cell does not affect the others
Approximation: One update of the overall approximation function affects all inputs.
- Making it more accurate for one state means other approx. for other state get less accurate.

Question 3

Q

How can approximation be improved?

Answer

A

Compare the approximation with the true value according to the Mean Squared Value Error.

sum of all states: weighting factor multiplied by (the difference between the real and approximate value) squared

VE(w) =° sum(s in states)(μ(s)[vπ(s)-^v(s,w)]^2

Question 4

Q

SGD weights expression

Answer

A

Wt+1 = Wt + α[Vπ(St)-^v(St,Wt)]*deriv(^v(St,Wt)

Vπ is the real value function
^v is the predicted value(v hat not power of v)

So use the R+γ*^v(S’,W) - TD error prediction instead of Vπ

Question 5

Q

Semi Gradient TD(0) for approximation of target values

Answer

A

Input: policy π, a differentiable function ^v that results in a Reward and ^v(terminal,.) = 0
Parameter α > 0 (step size)

Initialise value-function weights arbitrarily (eg w = 0)

For each episode:
- Init S
- Loop for each step of episode
- - Choose A ~ π(.|S)
- - Take action A, observe R and S’
- - w <- SGD weights expression
- - S <- S’
until S’ is terminal

Question 6

Q

Policy gradient methods

Answer

A

Select actions based on preference

π(α|s,θ) = Pr{At = a | St = s, θt = θ}

θ is the vector of params that express a preference
θt+1 = θt + α*deriv(J(θt))

Question 7

Q

Monte-Carlo Policy Gradient Control

Answer

A

Input: a differentiable polict parameterisation π(a|s,θ) - neural net
param: step size a > 0

Init policy param θ in R^d’ (eg to 0) - intiialise net with random weights

Forever for each episode:
- Generate an episode S0,A0,R1,…,ST-1,AT-1,RT, following π(.|. ,0)
- Loop for each step of the episode t = 0,1,…,T-1:
- - G <- sum(k->t+1 to T)(Rk)
- - θ <- θ + αγ^tG*deriv(ln(π(At|St,θ)

Question 8

Q

What is the upside down triangle symbol?

Answer

A

gradient eg derivate of the error with respect to all parameters

Question 9

Q

Actor Critic methods

Answer

A

Combines policy methods and q value approaches.
Two networks; policy (π) and value (v)
policy net gives you probabilty of action given state whereas value only gives value at particular state, regardless of actions

Two learning rates. Initialise both networks.

Start with state and identity matrix
Choose action from policy
Observe next state and reward
Update delta δ with TD error
Use this to update both networks (Identity matrix multiplied with delta on gradient as well as step size)
Set identity matrix to I multiplied by γ
S’ is then the current state.

Question 10

Q

Who is the actor and who is the critic?

Answer

A

The policy network is the actor (it decides the actions)
The value network is the critic (criticises actions etc)

8/9 - Approximate Methods and Deep RL Flashcards

(10 cards)