10 Reinforcement Learning* Flashcards
what is q learning
create table storing state-action pairs
update table
what is the pseudo code for q learning
initialise Q(s, a)
initialise Q(termical) to 0
for each episode:
initialise state S
for each step in episode:
do
A <- select action
take action A, then observe reward R and next state S
Q(S, A) <- Q(S, A, + a[R + ymax Q(S, a) - Q(S, A)]
update state
while S is not terminal
end
end
how is q learning different
off policy by choosing the action with the max q value for the state
how is sarsa different
on policy by choosing the action defined by the policy and updates its q value
what is the pseudo code for sarsa
initialise Q arbitrarily
repeat (for each episode):
initialise s
choose a from s using policy derived from Q(eg. epsilon greedy)
repeat (for each step of episode):
take action a, observe r, s’
choose a’ from s’ using policy derived from Q
update s<- s’; a <- a’
until s is terminal
what does RL solve task
- use model based approach
- value learning by analysing how good to reach a certain state or take specific action
- derive a policy that maximize rewards (policy gradient)
value iteration
- start with a random value function
- algorithm simpler
- guaranteed to converge
- more expensive
- require more iterations
policy iteration
- start with a random policy
- algorithm more complex
- guaranteed to converge
- cheaper to compute
- require fewer iterations
epsilon greedy
- exploration
choose action randomly - exploitation
choose action based on highest rewards
2 other models
- actor critic
- imitation learning