Model free control Flashcards
In what MDP scenarios would we typically use model free control?
1) MDP us unknown
2) MDP is to large to use, except by sampling.
How does the model free control problem differ from the previous problems?
We don’t aim to learn V(s) anymore, instead we only learn Q(s,a).
How do we perform policy improvement in the MC Model free control setting?
We can greedy choose the next action with respect to Q(s,a) so pi’ = max Q(s,a) This is different to an MDP approach where we use pi’ = max (R_s^a + Pss’^a V(s’))
What assumption do we need for MC methods to accuratly predict Vpi and Qpi?
1) Infinitely many episodes
2) We used all different starting possitions infinitle many times.
Prove that the policy improvement theorem holds for MC policy improvement
Q{pi_k}(s, pi_{k+1}(s)) = Q{pi_k}(s, argmax(a) Q{k}(s))
= max(a) Q{pi_k}(s, a)
>= Q{pi_k}(s,pi_k(s))
What are on-policy and of policy methods?
On policy: improve pi by sampling from pi
Off policy: improve pi by sampling from another policy pi’
Name one on policy method:
e-greedy ( a soft policy method). Will select another action than the optimal with a e/(number of actions) probability.
What is the GLIE property?
the GLIE(Greedy in the Limit with Infinite Exploration) criterion tells when a shedule for updating exploration paramter is sufficient to ensure convergence
What is MC control batch learning?
Do a batch of traces before updating the policy
How can is Q updated in SARSA?
Q(S,A)
Under which conditions does SARSA converge?
1) GLIE for policies
2) Robbins-Monroe for alpha,
sum(alpha_t) = inf,
sum(alpha_t^2) < inf
What is the target policy and behaviour policy in Off-Policy methods?
The target policy is the policy we “intend to use”. We wan’t to learn the value of the target policy. The behaviour policy is the policy we learn from.
Wht is the asusmption of coverage in off-policy methods?
We require that each action taken under the target policy is taken at some time under the behaviour policy.
pi(s,a) > 0 => pi’(s,a) > 0
What is the Q-learning algorithm
We choose the next action from the behavior policy, e.g. e-greedy. But we update the succsesor action using the target policy:
Q(S,A)