10 Reinforcement Learning* Flashcards

1
Q

what is q learning

A

create table storing state-action pairs

update table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the pseudo code for q learning

A

initialise Q(s, a)
initialise Q(termical) to 0
for each episode:
initialise state S
for each step in episode:
do
A <- select action
take action A, then observe reward R and next state S
Q(S, A) <- Q(S, A, + a[R + ymax Q(S, a) - Q(S, A)]
update state
while S is not terminal
end
end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how is q learning different

A

off policy by choosing the action with the max q value for the state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how is sarsa different

A

on policy by choosing the action defined by the policy and updates its q value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the pseudo code for sarsa

A

initialise Q arbitrarily
repeat (for each episode):
initialise s
choose a from s using policy derived from Q(eg. epsilon greedy)
repeat (for each step of episode):
take action a, observe r, s’
choose a’ from s’ using policy derived from Q
update s<- s’; a <- a’
until s is terminal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does RL solve task

A
  1. use model based approach
  2. value learning by analysing how good to reach a certain state or take specific action
  3. derive a policy that maximize rewards (policy gradient)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

value iteration

A
  1. start with a random value function
  2. algorithm simpler
  3. guaranteed to converge
  4. more expensive
  5. require more iterations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

policy iteration

A
  1. start with a random policy
  2. algorithm more complex
  3. guaranteed to converge
  4. cheaper to compute
  5. require fewer iterations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

epsilon greedy

A
  1. exploration
    choose action randomly
  2. exploitation
    choose action based on highest rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2 other models

A
  1. actor critic
  2. imitation learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly