Model free learning Flashcards

1
Q

Does MC use bootstrapping?

A

No, MC learns from complete episodes, no bootstrapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the idea behind the mote carlo approach for policy evaluation?

A

Our goal is to estimate the value function for each state. The idea of MC is to estimate that value function (expected returns) by average sample traces (empirical means).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the first visit monte carlo algorithm for policy evaluation

A

for each trace t,
for all s in t,
1)append return from the first appearance of s in t to to Returns(s)
2) set V(s) = average(Returns(s))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between the first and every visit method of monte-carlo.

A

Every visist calculates the return each time s appears in a trace and averages all of them. First visit only calculates the return of the first time we visit s in the trace.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the formula for a running mean

A

mu_new = mu + alpha(x - mu)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When should we use the running mean instead of actual mean?

A

If the world is non-stationary the running mean incorporates the effect that “old” episodes count less then new episodes. Also in case of the running mean we don’t have to store k_s, the number of times we have “seen” s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main advantage of TD (Temporal Difference) learnging?

A

It combines the sampling from monte carlo and bootstrapping from dynamic programming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the value update method for TD?

A

V(S_t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the:

1) Temporal difference error?
2) Temporal differece target?

A

1) r_t + gamma*V(S_t+1) - V(S_t)

2) r_t + gamma*V(S_t+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main advantages of TD?

A

1) TD can before the final outcome
2) TD can learn without a final outcome
3) TD can learn from incomplete episodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the bias and variance of TD and MC?

A

MC high variance, no bias

TD low variance, some bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give a comparision of MC and TD?

A

MC:
1) Good convergence
2) Good convergence for function approx.
3) Not very sensitive to inital value
4) Simple
5) Usually more efficent in non-markov environments
TD:
1) Usually more efficient than MC
2) Converges to V_pi(S)
3) Convergense not guarnateed for function approx.
4) more sensitive to inital values
5) Usually more efficent in markov environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly