Model free learning Flashcards

Question 1

Q

Does MC use bootstrapping?

Answer

A

No, MC learns from complete episodes, no bootstrapping.

Question 2

Q

What is the idea behind the mote carlo approach for policy evaluation?

Answer

A

Our goal is to estimate the value function for each state. The idea of MC is to estimate that value function (expected returns) by average sample traces (empirical means).

Question 3

Q

Describe the first visit monte carlo algorithm for policy evaluation

Answer

A

for each trace t,
for all s in t,
1)append return from the first appearance of s in t to to Returns(s)
2) set V(s) = average(Returns(s))

Question 4

Q

What is the difference between the first and every visit method of monte-carlo.

Answer

A

Every visist calculates the return each time s appears in a trace and averages all of them. First visit only calculates the return of the first time we visit s in the trace.

Question 5

Q

What is the formula for a running mean

Answer

A

mu_new = mu + alpha(x - mu)

Question 6

Q

When should we use the running mean instead of actual mean?

Answer

A

If the world is non-stationary the running mean incorporates the effect that “old” episodes count less then new episodes. Also in case of the running mean we don’t have to store k_s, the number of times we have “seen” s.

Question 7

Q

What is the main advantage of TD (Temporal Difference) learnging?

Answer

A

It combines the sampling from monte carlo and bootstrapping from dynamic programming.

Question 8

Q

What is the value update method for TD?

Question 9

Q

What is the:

1) Temporal difference error?
2) Temporal differece target?

Answer

A

1) r_t + gamma*V(S_t+1) - V(S_t)

2) r_t + gamma*V(S_t+1)

Question 10

Q

What are the main advantages of TD?

Answer

A

1) TD can before the final outcome
2) TD can learn without a final outcome
3) TD can learn from incomplete episodes

Question 11

Q

How is the bias and variance of TD and MC?

Answer

A

MC high variance, no bias

TD low variance, some bias

Question 12

Q

Give a comparision of MC and TD?

Answer

A

MC:
1) Good convergence
2) Good convergence for function approx.
3) Not very sensitive to inital value
4) Simple
5) Usually more efficent in non-markov environments
TD:
1) Usually more efficient than MC
2) Converges to V_pi(S)
3) Convergense not guarnateed for function approx.
4) more sensitive to inital values
5) Usually more efficent in markov environments

Model free learning Flashcards

(12 cards)