Model free control Flashcards

Question 1

Q

In what MDP scenarios would we typically use model free control?

Answer

A

1) MDP us unknown

2) MDP is to large to use, except by sampling.

Question 2

Q

How does the model free control problem differ from the previous problems?

Answer

A

We don’t aim to learn V(s) anymore, instead we only learn Q(s,a).

Question 3

Q

How do we perform policy improvement in the MC Model free control setting?

Answer

A

We can greedy choose the next action with respect to Q(s,a) so pi’ = max Q(s,a) This is different to an MDP approach where we use pi’ = max (R_s^a + Pss’^a V(s’))

Question 4

Q

What assumption do we need for MC methods to accuratly predict Vpi and Qpi?

Answer

A

1) Infinitely many episodes

2) We used all different starting possitions infinitle many times.

Question 5

Q

Prove that the policy improvement theorem holds for MC policy improvement

Answer

A

Q{pi_k}(s, pi_{k+1}(s)) = Q{pi_k}(s, argmax(a) Q{k}(s))
= max(a) Q{pi_k}(s, a)
>= Q{pi_k}(s,pi_k(s))

Question 6

Q

What are on-policy and of policy methods?

Answer

A

On policy: improve pi by sampling from pi

Off policy: improve pi by sampling from another policy pi’

Question 7

Q

Name one on policy method:

Answer

A

e-greedy ( a soft policy method). Will select another action than the optimal with a e/(number of actions) probability.

Question 8

Q

What is the GLIE property?

Answer

A

the GLIE(Greedy in the Limit with Infinite Exploration) criterion tells when a shedule for updating exploration paramter is sufficient to ensure convergence

Question 9

Q

What is MC control batch learning?

Answer

A

Do a batch of traces before updating the policy

Question 10

Q

How can is Q updated in SARSA?

Question 11

Q

Under which conditions does SARSA converge?

Answer

A

1) GLIE for policies
2) Robbins-Monroe for alpha,
sum(alpha_t) = inf,
sum(alpha_t^2) < inf

Question 12

Q

What is the target policy and behaviour policy in Off-Policy methods?

Answer

A

The target policy is the policy we “intend to use”. We wan’t to learn the value of the target policy. The behaviour policy is the policy we learn from.

Question 13

Q

Wht is the asusmption of coverage in off-policy methods?

Answer

A

We require that each action taken under the target policy is taken at some time under the behaviour policy.
pi(s,a) > 0 => pi’(s,a) > 0

Question 14

Q

What is the Q-learning algorithm

Answer

A

We choose the next action from the behavior policy, e.g. e-greedy. But we update the succsesor action using the target policy:
Q(S,A)

Model free control Flashcards

(14 cards)