Exam Flashcards

1
Q

What are the key similarities and differences between DP and TD methods?

A

Similarities

  1. They are used to find/estimate V(s), Q(s,a) and/ or optimal policies
  2. Both use bootstrapping to update the valuefunction and/or policy iteratively

Differences:

  1. DP learns from the MDP using P and R. TD is model-free and learns from sample traces
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the analougus TD methods to:

1) Policy evalution
2) Policy iteration
3) Value iteration

A

1) TD learning
2) SARSA
3) Q-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For and MDP, what are the:

1) Belmann equation for V(s)
2) Belman optimaility equation for MDP V(s)?

A

1) V(s) = Sum_a pi(s,a) Sum_s’ P(s,a,s’) (r(s,a,s’) + yV(s’))
2) V(s) = max_a Sum_s1 P(s,a,s’) (r(s,a,s’) + yV(s’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What criteria does a problem need to have to be solvable by DP?

A

1) Optimal substructure
The solution is a combination of optimal solutions to subproblems
2) Overlapping subproblems
A recursive algorithm solving the problem should solve “the same” problem over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the policy improvement theorem

A

Let: pi, pi’ denote deterministic policies.
If Q_pi(s, pi’(s)) >= V_pi(s) for all s then the policy pi’ is as good or better than the old policy.
V_pi’(s) >= V_pi(s) for all s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is value iteration?

A

Policy iteration with only one step of policy evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the idea behind Monte Carlo methods?

A

Use the average of samples to estimate true values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When can we use TD, but not MC?

A

TD:

1) TD can learn from incomplete episodes
2) TD can learn online after each step
3) TD works in non-terminating environments

MC:

1) MC needs a complete episode
2) MC has to wait until the end of an epsiode to learn
3) MC does not work in non-terminating environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Compare the properties of TD and MC

A

TD:

1) Has low variance, some bias
2) Good convergence, but not necessarily for function approximation
3) TD usually works best in markov environments
4) More sensitive to inital value

MC

1) Has high variance, no bias
2) Good convergence even for function approximation
3) Usually works best in non-markov environments
4) Easy to understand an use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compare on-policy and off-policy methods

A

On-policy methods use one method to determine behaviour during sample collection and improves this method.

Off-policy methods use a target and behaviour policy. The behaviour policy determines the which action to take under sample collection, while the target policy is the policy we wan’t to improve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the GLIE (Greedy in limit with infinite exploration) condition?

A

Let Nk(s,a) be the number of times we visited state s,a after k iterations. Then the GLIE condition is that:

1) lim(k->inf) Nk(s,a) = inf for all k
2) lim(k->inf) pi(s,a) = greedy policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name one e_k making e-greedy GLIE. (k denotes that we have done k iterations).

A

e_k = 1/k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the update for Monte Carlo model free control?

A

Q(s,a) = Average(Returns(s,a))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Under what conditions does SARSA converge?

A

GLIE for policy and alpha is a Robbins-Monroe sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Robbins-Monroe sequence?

A

A sequence a_t is a robbins monroe sequence if:

1) Sum a_t = inf
2) sum a_t^2 < inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the assumption of coverage in off-policy methods?

A

Every action that can be taken under the target policy must also be taken under the behaviour policy.

17
Q

What is the SARSA update?

A

Q(s,a) = Q(s,a) + a(r + yQ(s’, a’) - Q(s,a))

18
Q

What is the Q-learning update?

A

Q(s,a) = Q(s,a) + a(r + y max_ a’ Q(s’, a’) - Q(s,a))

19
Q

What properties do we wan’t from a function approximation?

A

1) The approximation should generalize to unseen state/actions
2) We should be apple to update our approximation using MC or TD
3) Our approximation should be close to the true values

20
Q

Name som approximators

A

1) Linear combinations
2) Neural Network
3) Descision trees
4) Fourier / Wavlet
5) Nearest Neigbhour

21
Q

How do we train our function approximator?

A

We minimize the MSE between our approximation and the true value using gradient descent.

22
Q

What is:

1) Coarse coding
2) Tile coding
3) RBF

A

They are all different ways of representing feature vectors.

1) Overlapping circles
2) Multiple exhaustive grids
3) Exponential functions

23
Q

What non-standard techniques where used in the attari problem?

A

1) Experience replay
2) Clipping rewards
3) Target network
4) Skipping of frames

24
Q

In a “open” question where you are supposed to design a RL agent. What is it important to write about

A

1) State space
2) Action space
3) Reward
4) Model-vs Model-free
5) Dynamic Programming/ Monte-carlo/ TD
6) Specifiy 4, for example Policy vs. value iteration, SARSA vs Q-Learning, batch-vs non-batch
If time:
7) Function approximation (Neural networks…)
8) Feature encoding (Coarse coading…)