Generalization Flashcards
What is generalization?
The idea of leveraging information about states we’ve been to predict states we haven’t yet been
What are some mappings/functions where you can apply generalization?
Mapping states to action
Value function/Q function
An RL model (transitions/rewards)
How do you represent Q learning as an update rule?
Delta(w^a_i) = alpha(r + gammamax_a’ Q(s’,a’) - Q(s,a)) * (partial Q(s,a)/ partial w^a_i)
the r + gammamax_a’ Q(s’,a’) portion is the Bellman equation
the (r + gammamax_a’ Q(s’,a’) - Q(s,a)) term is effectively TD error
What is the setup for a linear value function approximator?
Q(s,a) = Sum^n_i=1 [f_i(s) * w^a_i]
Each action gets a set of n weights to represent the Q values for that action. Dot product of weights gives the “importance” of all the features in contributing to an action’s value.
Looks like a 1-layer neural net w/o the max
Does generalization of the value function work in practice?
Kinda, there are some very successful examples
3-layer backprop (Tesauro, Wang/Dietrich)
CMAC (Sutton)
Linear nets (Singh, Parr)
DQN for Atari (Minh et all)
What are the requirements for features when doing value function generalization?
- They are computationally efficient
2. They are value informative
What is the interplay between the generalization papers by Tesauro, Boyan/Moore, and Sutton?
Tesauro got TD-Gammon to work really well
Boyan/Moore showed there are some examples where it should work well but doesn’t
Sutton then showed that in those bad examples things can work if you do training the right way
Bonus - Pollack showed that genetic algorithms work really well for backgammon as well. Is backgammon just good for learning or is TD value function approximation/generalization actually good?
Briefly describe Baird’s counter-example for generalization?
The problem is constructed in a simple near-tabular form where the true value function is a zero mapping (no rewards). Given the feature/weight vector setup it is infinitely representable (seems like it should be easy to learn). However, if using a bad update sequence we learn the wrong value function (the weights blow up to inf).
If properly initialized, however, (i.e. to 0) the exact answer is sticky as there is no reward (relies on the deterministic nature of the problem)
What is an averager (function approximators)?
Function approximators that interpolate points between those seen as convex combinations of known points.