Convergence Flashcards
What is the difference between RL with control and without?
RL with control involves the agent selecting actions
What is the Bellman equation with actions? How does this relate to a Bellman operator?
Q(s,a) = R(s,a) + gamma*Sum_s'[T(s,a,s') max_a'(Q(s',a'))] [BQ](s,a) = the above form
What does the Bellman operator Q* = BQ* describe? Q_t = BQ_t-1?
Q* = BQ* = Bellman equation Q_t = BQ_t-1 = Value iteration
What is a contraction mapping?
B is a operator (maps value functions to value functions)
If, for all Q functions F,G and some 0 <= gamma < 1
|| BF-BG || _inf <= gamma*|| F-G ||_inf
then B is a contraction mapping
Note ||Q||inf = max(s,a) |Q(s,a)|
What are the properties of a contraction mapping?
If B is a contraction mapping:
- F* = BF* has a unique solution
- F_t = BF_t-1 => F_t -> F* (i.e. value iteration converges)
T/F: Is the max operator a non-expansion?
True Assert WLOG - max_a f(a) >= g(a) |max_a f(a) - max_a g(a)| = max_a f(a) -max_a g(a) = f(a1) - g(a2) (for some a1, a2) <= f(a1) - g(a1) (by our assertion) = |f(a1) - g(a1)| <= max_a |f(a) - g(a)|
How can we generalize MDPs? Do they converge? If so, to what?
Using the generalized Bellman Equation we can change the “sum” and “multiplication” operator.
They will converge as long as they are non-expanding.
They converge to some fixed point based on the definition of the model, not necessarily Q*
e.g. we can minimize the outcome of the maximizing action to develop a risk averse MDP model
What are some examples of non-expansion operators?
Order statistics: max, min, etc
Fixed convex combination