7- Reinforcement Learning - SARSA/Q Flashcards
SARSA
State Action Reward State Action
Estimates all action values q(s,a) for all s and a with the policy π
SARSA pesudocode
Step size a between 0,1
small E > 0
Intialise Q(s,a) for all except Q(terminal,) = 0
For each episode:
- Init S
- Choose A from S using policy from Q
- Loop for each step of episode:
- - Take action A, observe R, S’
- - Choose A’ from S’ (using policy…)
- - Q(S,A) <- Q(S,A) + a[R+γQ(S’,A’)-Q(S,A)]
- until S is terminal
Probably don’t have to remember all this
Q learning
Approximates q*(a) - that is q subscript star, not q times a
Q(St,At) <- Q(St,At) + α[Rt+1 + γmaxaQ(St+1,a) - Q(St,At)] independently of the policy (off policy)
Can be a lookup table with all states and actions
Q learning Pseudocode
params: Step size α in (0,1], small epsilon >0
Init Q(s,a) for all states and actions except Q(teminal,.) = 0
For each episode:
- Choose A from S using policy Q
- Take action A observe R, S’
- Q(S,A) <- Q(S,A) + α[R + γmaxaQ(S’,a)-Q(S,A)]
- S <- S’ (make it current state)
until S is terminal
SARSA vs Q Learning: The Cliff
Each step is R=-1 but a fall is R=-100
Q learning finds the shortest path near the cliff but occasionally falls.
SARSA takes into account falling (it’s online) so runs along a safer path
Q learning may be more optimistic