7- Reinforcement Learning - SARSA/Q Flashcards

1
Q

SARSA

A

State Action Reward State Action

Estimates all action values q(s,a) for all s and a with the policy π

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SARSA pesudocode

A

Step size a between 0,1
small E > 0
Intialise Q(s,a) for all except Q(terminal,) = 0

For each episode:
- Init S
- Choose A from S using policy from Q
- Loop for each step of episode:
- - Take action A, observe R, S’
- - Choose A’ from S’ (using policy…)
- - Q(S,A) <- Q(S,A) + a[R+γQ(S’,A’)-Q(S,A)]
- until S is terminal

Probably don’t have to remember all this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q learning

A

Approximates q*(a) - that is q subscript star, not q times a

Q(St,At) <- Q(St,At) + α[Rt+1 + γmaxaQ(St+1,a) - Q(St,At)] independently of the policy (off policy)

Can be a lookup table with all states and actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q learning Pseudocode

A

params: Step size α in (0,1], small epsilon >0
Init Q(s,a) for all states and actions except Q(teminal,.) = 0

For each episode:
- Choose A from S using policy Q
- Take action A observe R, S’
- Q(S,A) <- Q(S,A) + α[R + γmaxaQ(S’,a)-Q(S,A)]
- S <- S’ (make it current state)
until S is terminal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SARSA vs Q Learning: The Cliff

Each step is R=-1 but a fall is R=-100

A

Q learning finds the shortest path near the cliff but occasionally falls.
SARSA takes into account falling (it’s online) so runs along a safer path
Q learning may be more optimistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly