reinforcement learning Flashcards
Sequential Decision Making
Temporal Credit Assignment problems
- needs sequential action selection to be completed
- has state transitions triggered by actions
- involves delayed rewards based on success / failure
Markov Decision Process
states, actions, transition function, reward function
goal is to optimize long-term reward
continuous tasks => return is infinite => need a discount
Markov assumption
- the effect of actions depend only on the current state
- means that everything I need to know can be observed, no need to remember anything
Sutton & Barto V2
- combine transition (probabilistic) and reward function => dynamic function => probabilistic function
- can be transformed into deterministic function by taking the expect value of function above
Policy
- feedback plan, not open-loop plan
- alternate between policy iteration and policy improvement (together called policy iteration)to converge to optimal policy
Policy evaluation: - do some iterations (no need to do this until convergence) of computing the Bellman equation for every state according to a deterministic policy
- when computing the value function for a state, we take into account all possible actions
Policy improvement: - update the policy according to highest action-value function value for every state (so compute all possible actions for a state and choose the max, do this for all states)
Extreme Value Iteration: - do policy evaluation once and combine with policy improvement in one single step
- repeatedly sweep through state space until convergence
Model-based RL
observe model, keep track of state transitions, count, compute probabilities for all state transitions, do the same for rewards
Model-free RL
estimate value function / q-value function / policy DIRECTLY
On-policy RL
learn the value function of the policy that selects the action
behavior generation is coupled to learning V or Q
Off-policy RL
convergence of value function is not influenced by actions selected
behavior generation is decoupled from learning V or Q
Temporal difference (model-free, on-policy)
- do not use any reward/transition function, only policy
- update value function based on SAMPLED EXPERIENCE
- updated value = old value + learning rate * (reward + temporal difference)
- TD = discounted value of new state - value of old state
SARSA (model-free, on-policy)
- fix exploration policy
- update Q function based on this exploration policy (equation very similar to TD)
- after convergence, we have final Q function, which we can use to improve exploration policy
- another approach is to improve exploration policy WHILE updating Q function => epsilon-greedy
Q-learning (model-free, off-policy)
update of value function is not influenced by actually selected action
updated value = old value + learning rate * (reward + discount * highest value possible from new state - old value)