Exam Flashcards

Question 1

Q

What are the key similarities and differences between DP and TD methods?

Answer

A

Similarities

They are used to find/estimate V(s), Q(s,a) and/ or optimal policies
Both use bootstrapping to update the valuefunction and/or policy iteratively

Differences:

DP learns from the MDP using P and R. TD is model-free and learns from sample traces

Question 2

Q

What are the analougus TD methods to:

1) Policy evalution
2) Policy iteration
3) Value iteration

Answer

A

1) TD learning
2) SARSA
3) Q-learning

Question 3

Q

For and MDP, what are the:

1) Belmann equation for V(s)
2) Belman optimaility equation for MDP V(s)?

Answer

A

1) V(s) = Sum_a pi(s,a) Sum_s’ P(s,a,s’) (r(s,a,s’) + yV(s’))
2) V(s) = max_a Sum_s1 P(s,a,s’) (r(s,a,s’) + yV(s’))

Question 4

Q

What criteria does a problem need to have to be solvable by DP?

Answer

A

1) Optimal substructure
The solution is a combination of optimal solutions to subproblems
2) Overlapping subproblems
A recursive algorithm solving the problem should solve “the same” problem over and over again

Question 5

Q

What is the policy improvement theorem

Answer

A

Let: pi, pi’ denote deterministic policies.
If Q_pi(s, pi’(s)) >= V_pi(s) for all s then the policy pi’ is as good or better than the old policy.
V_pi’(s) >= V_pi(s) for all s.

Question 6

Q

What is value iteration?

Answer

A

Policy iteration with only one step of policy evaluation.

Question 7

Q

What is the idea behind Monte Carlo methods?

Answer

A

Use the average of samples to estimate true values

Question 8

Q

When can we use TD, but not MC?

Answer

A

TD:

1) TD can learn from incomplete episodes
2) TD can learn online after each step
3) TD works in non-terminating environments

MC:

1) MC needs a complete episode
2) MC has to wait until the end of an epsiode to learn
3) MC does not work in non-terminating environments

Question 9

Q

Compare the properties of TD and MC

Answer

A

TD:

1) Has low variance, some bias
2) Good convergence, but not necessarily for function approximation
3) TD usually works best in markov environments
4) More sensitive to inital value

MC

1) Has high variance, no bias
2) Good convergence even for function approximation
3) Usually works best in non-markov environments
4) Easy to understand an use

Question 10

Q

Compare on-policy and off-policy methods

Answer

A

On-policy methods use one method to determine behaviour during sample collection and improves this method.

Off-policy methods use a target and behaviour policy. The behaviour policy determines the which action to take under sample collection, while the target policy is the policy we wan’t to improve.

Question 11

Q

What is the GLIE (Greedy in limit with infinite exploration) condition?

Answer

A

Let Nk(s,a) be the number of times we visited state s,a after k iterations. Then the GLIE condition is that:

1) lim(k->inf) Nk(s,a) = inf for all k
2) lim(k->inf) pi(s,a) = greedy policy

Question 12

Q

Name one e_k making e-greedy GLIE. (k denotes that we have done k iterations).

Answer

A

e_k = 1/k

Question 13

Q

What is the update for Monte Carlo model free control?

Answer

A

Q(s,a) = Average(Returns(s,a))

Question 14

Q

Under what conditions does SARSA converge?

Answer

A

GLIE for policy and alpha is a Robbins-Monroe sequence

Question 15

Q

What is a Robbins-Monroe sequence?

Answer

A

A sequence a_t is a robbins monroe sequence if:

1) Sum a_t = inf
2) sum a_t^2 < inf

Question 16

Q

What is the assumption of coverage in off-policy methods?

Answer

Study These Flashcards

A

Every action that can be taken under the target policy must also be taken under the behaviour policy.

Question 17

Q

What is the SARSA update?

Answer

Study These Flashcards

A

Q(s,a) = Q(s,a) + a(r + yQ(s’, a’) - Q(s,a))

Question 18

Q

What is the Q-learning update?

Answer

Study These Flashcards

A

Q(s,a) = Q(s,a) + a(r + y max_ a’ Q(s’, a’) - Q(s,a))

Question 19

Q

What properties do we wan’t from a function approximation?

Answer

Study These Flashcards

A

1) The approximation should generalize to unseen state/actions
2) We should be apple to update our approximation using MC or TD
3) Our approximation should be close to the true values

Question 20

Q

Name som approximators

Answer

Study These Flashcards

A

1) Linear combinations
2) Neural Network
3) Descision trees
4) Fourier / Wavlet
5) Nearest Neigbhour

Question 21

Q

How do we train our function approximator?

Answer

Study These Flashcards

A

We minimize the MSE between our approximation and the true value using gradient descent.

Question 22

Q

What is:

1) Coarse coding
2) Tile coding
3) RBF

Answer

Study These Flashcards

A

They are all different ways of representing feature vectors.

1) Overlapping circles
2) Multiple exhaustive grids
3) Exponential functions

Question 23

Q

What non-standard techniques where used in the attari problem?

Answer

Study These Flashcards

A

1) Experience replay
2) Clipping rewards
3) Target network
4) Skipping of frames

Question 24

Q

In a “open” question where you are supposed to design a RL agent. What is it important to write about

Answer

Study These Flashcards

A

1) State space
2) Action space
3) Reward
4) Model-vs Model-free
5) Dynamic Programming/ Monte-carlo/ TD
6) Specifiy 4, for example Policy vs. value iteration, SARSA vs Q-Learning, batch-vs non-batch
If time:
7) Function approximation (Neural networks…)
8) Feature encoding (Coarse coading…)

Exam Flashcards

(24 cards)