Model free control Flashcards

1
Q

In what MDP scenarios would we typically use model free control?

A

1) MDP us unknown

2) MDP is to large to use, except by sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the model free control problem differ from the previous problems?

A

We don’t aim to learn V(s) anymore, instead we only learn Q(s,a).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we perform policy improvement in the MC Model free control setting?

A

We can greedy choose the next action with respect to Q(s,a) so pi’ = max Q(s,a) This is different to an MDP approach where we use pi’ = max (R_s^a + Pss’^a V(s’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What assumption do we need for MC methods to accuratly predict Vpi and Qpi?

A

1) Infinitely many episodes

2) We used all different starting possitions infinitle many times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Prove that the policy improvement theorem holds for MC policy improvement

A

Q{pi_k}(s, pi_{k+1}(s)) = Q{pi_k}(s, argmax(a) Q{k}(s))
= max(a) Q{pi_k}(s, a)
>= Q{pi_k}(s,pi_k(s))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are on-policy and of policy methods?

A

On policy: improve pi by sampling from pi

Off policy: improve pi by sampling from another policy pi’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name one on policy method:

A

e-greedy ( a soft policy method). Will select another action than the optimal with a e/(number of actions) probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the GLIE property?

A

the GLIE(Greedy in the Limit with Infinite Exploration) criterion tells when a shedule for updating exploration paramter is sufficient to ensure convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is MC control batch learning?

A

Do a batch of traces before updating the policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can is Q updated in SARSA?

A

Q(S,A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Under which conditions does SARSA converge?

A

1) GLIE for policies
2) Robbins-Monroe for alpha,
sum(alpha_t) = inf,
sum(alpha_t^2) < inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the target policy and behaviour policy in Off-Policy methods?

A

The target policy is the policy we “intend to use”. We wan’t to learn the value of the target policy. The behaviour policy is the policy we learn from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Wht is the asusmption of coverage in off-policy methods?

A

We require that each action taken under the target policy is taken at some time under the behaviour policy.
pi(s,a) > 0 => pi’(s,a) > 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Q-learning algorithm

A

We choose the next action from the behavior policy, e.g. e-greedy. But we update the succsesor action using the target policy:
Q(S,A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly