reinforcement learning Flashcards

1
Q

Sequential Decision Making

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Temporal Credit Assignment problems

A
  • needs sequential action selection to be completed
  • has state transitions triggered by actions
  • involves delayed rewards based on success / failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Markov Decision Process

A

states, actions, transition function, reward function
goal is to optimize long-term reward
continuous tasks => return is infinite => need a discount

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Markov assumption

A
  • the effect of actions depend only on the current state
  • means that everything I need to know can be observed, no need to remember anything
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sutton & Barto V2

A
  • combine transition (probabilistic) and reward function => dynamic function => probabilistic function
  • can be transformed into deterministic function by taking the expect value of function above
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Policy

A
  • feedback plan, not open-loop plan
  • alternate between policy iteration and policy improvement (together called policy iteration)to converge to optimal policy
    Policy evaluation:
  • do some iterations (no need to do this until convergence) of computing the Bellman equation for every state according to a deterministic policy
  • when computing the value function for a state, we take into account all possible actions
    Policy improvement:
  • update the policy according to highest action-value function value for every state (so compute all possible actions for a state and choose the max, do this for all states)
    Extreme Value Iteration:
  • do policy evaluation once and combine with policy improvement in one single step
  • repeatedly sweep through state space until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model-based RL

A

observe model, keep track of state transitions, count, compute probabilities for all state transitions, do the same for rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Model-free RL

A

estimate value function / q-value function / policy DIRECTLY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

On-policy RL

A

learn the value function of the policy that selects the action
behavior generation is coupled to learning V or Q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Off-policy RL

A

convergence of value function is not influenced by actions selected
behavior generation is decoupled from learning V or Q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Temporal difference (model-free, on-policy)

A
  • do not use any reward/transition function, only policy
  • update value function based on SAMPLED EXPERIENCE
  • updated value = old value + learning rate * (reward + temporal difference)
  • TD = discounted value of new state - value of old state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SARSA (model-free, on-policy)

A
  • fix exploration policy
  • update Q function based on this exploration policy (equation very similar to TD)
  • after convergence, we have final Q function, which we can use to improve exploration policy
  • another approach is to improve exploration policy WHILE updating Q function => epsilon-greedy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q-learning (model-free, off-policy)

A

update of value function is not influenced by actually selected action
updated value = old value + learning rate * (reward + discount * highest value possible from new state - old value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly