Rewards Flashcards
Why would we want to change the reward function for an MDP?
To make the MDP easier (speed, space, solvability) to solve while learning something similar to what it would have learned anyways
How can we change the reward function without changing the optimal policy?
- Multiplying by a (positive) scalar
- Shifting by a scalar (adding)
- Non-linear potential-based transformations
What is the new Q function equal to if we multiply the reward function by a positive constant c?
Q’(s,a) = c*Q(s,a)
What is the new Q function equal to if we add a constant c to the reward function?
Q(s,a) = Q(s,a) + c/(1-gamma)
What is potential based reward shaping? What is the purpose?
Adding rewards for entering states but subtracting them when the state is exited. It is intended to encourage specific behavior (e.g. moving towards a goal) and speed up learning without creating an infinite reward pump.
What is equivalent to doing Q learning with potentials?
Q learning initialized with the potential function