8/9 - Approximate Methods and Deep RL Flashcards

1
Q

2 Problems with tabular approaches (Q/SARSA)

A

Table is too big in a big domain
Limited by learning each individual state regardless of the other states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Approximate methods seek to approximate the value function or state action value function

What is the catch?

A

Tabular: Updating one cell does not affect the others
Approximation: One update of the overall approximation function affects all inputs.
- Making it more accurate for one state means other approx. for other state get less accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can approximation be improved?

A

Compare the approximation with the true value according to the Mean Squared Value Error.

sum of all states: weighting factor multiplied by (the difference between the real and approximate value) squared

VE(w) =° sum(s in states)(μ(s)[vπ(s)-^v(s,w)]^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SGD weights expression

A

Wt+1 = Wt + α[Vπ(St)-^v(St,Wt)]*deriv(^v(St,Wt)

Vπ is the real value function
^v is the predicted value(v hat not power of v)

So use the R+γ*^v(S’,W) - TD error prediction instead of Vπ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi Gradient TD(0) for approximation of target values

A

Input: policy π, a differentiable function ^v that results in a Reward and ^v(terminal,.) = 0
Parameter α > 0 (step size)

Initialise value-function weights arbitrarily (eg w = 0)

For each episode:
- Init S
- Loop for each step of episode
- - Choose A ~ π(.|S)
- - Take action A, observe R and S’
- - w <- SGD weights expression
- - S <- S’
until S’ is terminal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Policy gradient methods

A

Select actions based on preference

π(α|s,θ) = Pr{At = a | St = s, θt = θ}

θ is the vector of params that express a preference
θt+1 = θt + α*deriv(J(θt))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Monte-Carlo Policy Gradient Control

A

Input: a differentiable polict parameterisation π(a|s,θ) - neural net
param: step size a > 0

Init policy param θ in R^d’ (eg to 0) - intiialise net with random weights

Forever for each episode:
- Generate an episode S0,A0,R1,…,ST-1,AT-1,RT, following π(.|. ,0)
- Loop for each step of the episode t = 0,1,…,T-1:
- - G <- sum(k->t+1 to T)(Rk)
- - θ <- θ + αγ^tG*deriv(ln(π(At|St,θ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the upside down triangle symbol?

A

gradient eg derivate of the error with respect to all parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Actor Critic methods

A

Combines policy methods and q value approaches.
Two networks; policy (π) and value (v)
policy net gives you probabilty of action given state whereas value only gives value at particular state, regardless of actions

Two learning rates. Initialise both networks.

Start with state and identity matrix
Choose action from policy
Observe next state and reward
Update delta δ with TD error
Use this to update both networks (Identity matrix multiplied with delta on gradient as well as step size)
Set identity matrix to I multiplied by γ
S’ is then the current state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Who is the actor and who is the critic?

A

The policy network is the actor (it decides the actions)
The value network is the critic (criticises actions etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly