reinforcement learning Flashcards

Question 1

Q

Sequential Decision Making

Question 2

Q

Temporal Credit Assignment problems

Answer

A

needs sequential action selection to be completed
has state transitions triggered by actions
involves delayed rewards based on success / failure

Question 3

Q

Markov Decision Process

Answer

A

states, actions, transition function, reward function
goal is to optimize long-term reward
continuous tasks => return is infinite => need a discount

Question 4

Q

Markov assumption

Answer

A

the effect of actions depend only on the current state
means that everything I need to know can be observed, no need to remember anything

Question 5

Q

Sutton & Barto V2

Answer

A

combine transition (probabilistic) and reward function => dynamic function => probabilistic function
can be transformed into deterministic function by taking the expect value of function above

Question 6

Q

Policy

Answer

A

feedback plan, not open-loop plan
alternate between policy iteration and policy improvement (together called policy iteration)to converge to optimal policy
Policy evaluation:
do some iterations (no need to do this until convergence) of computing the Bellman equation for every state according to a deterministic policy
when computing the value function for a state, we take into account all possible actions
Policy improvement:
update the policy according to highest action-value function value for every state (so compute all possible actions for a state and choose the max, do this for all states)
Extreme Value Iteration:
do policy evaluation once and combine with policy improvement in one single step
repeatedly sweep through state space until convergence

Question 7

Q

Model-based RL

Answer

A

observe model, keep track of state transitions, count, compute probabilities for all state transitions, do the same for rewards

Question 8

Q

Model-free RL

Answer

A

estimate value function / q-value function / policy DIRECTLY

Question 9

Q

On-policy RL

Answer

A

learn the value function of the policy that selects the action
behavior generation is coupled to learning V or Q

Question 10

Q

Off-policy RL

Answer

A

convergence of value function is not influenced by actions selected
behavior generation is decoupled from learning V or Q

Question 11

Q

Temporal difference (model-free, on-policy)

Answer

A

do not use any reward/transition function, only policy
update value function based on SAMPLED EXPERIENCE
updated value = old value + learning rate * (reward + temporal difference)
TD = discounted value of new state - value of old state

Question 12

Q

SARSA (model-free, on-policy)

Answer

A

fix exploration policy
update Q function based on this exploration policy (equation very similar to TD)
after convergence, we have final Q function, which we can use to improve exploration policy
another approach is to improve exploration policy WHILE updating Q function => epsilon-greedy

Question 13

Q

Q-learning (model-free, off-policy)

Answer

A

update of value function is not influenced by actually selected action
updated value = old value + learning rate * (reward + discount * highest value possible from new state - old value)

reinforcement learning Flashcards

(13 cards)