C3 Flashcards

Question 1

Q

why was the DQN that was able to play the Atari Breakout game such a big achievement?

Answer

A

true eye-hand coordination of this complexity had not been achieved by a computer before
the end-to-end learning from pixel to joystick implied artificial behavior that was close to how humans play games
the instability of the deep learning process had been overcome

Question 2

Q

why is there a qualitative difference between small and large problems?

Answer

A

for small problems, the policy can be learned through loading all states in memory, so that the states are identified individually and each state has its own best action that we can try to find. But large problems don’t fit in memory, so we can’t memorize the policy and states are grouped together based on their features

Question 3

Q

what are the three problems with a naive deep Q-learner?

Answer

A

convergence to the optimal Q-function depends on full coverage of the state space, but the state space is too large to fit in memory => no guarantee of convergence
there is a strong correlation between subsequent training samples, with a risk of local optima
the loss function of gradient descent has a moving target: both the new Q-value estimate and the old update target depend on the same parameters that are optimized => the optimization process can become unstable

Question 4

Q

why is correlation between subsequent training samples a problem?

Answer

A

this may result in biased training: the training may cover only a small part of the state space, so that there is too much exploitation and too little exploration. This results in bad generalization

Question 5

Q

what is the deadly triad?

Answer

A

three elements for divergent training: function approximation, bootstrapping and off-policy learning

Question 6

Q

the problem with function approximation

Answer

A

may attribute values to states inaccurately, because instead of identifying individual states exactly, neural networks are designed to individual features of states, which can be shared by different states => can cause mis-identification of states

Question 7

Q

the problem with bootstrapping

Answer

A

errors or biases in initial values may persist and spill over to other states if values are propagated incorrectly due to function approximation, because bootstrapping builds new values on the basis of older values

Question 8

Q

the problem with off-policy learning

Answer

A

it uses a behaviour policy that is different from the target policy we are optimizing for, so when the behaviour is improved, the off-policy values may not improve

Question 9

Q

why where experience replay and infrequent weight updates introduced in DQN?

Answer

A

to break correlations between subsequent states and to slow down the changes to parameters in the training process to improve stability

Question 10

Q

experience replay

Answer

A

it introduces a replay buffer, which is a cache of previously explored states, from which it samples training states at random => now we train states from a more diverse set instead of only from the most recent one => the next state to be trained is no longer a direct successor of the current state => increased independence of subsequent states => learning is spread out over more previously seen states

improves coverage
reduces correlation

Question 11

Q

infrequent weight updates

Answer

A

every n updates, the network Q is cloned to obtain the target network, which is used for generating the targets for the following n updates to Q, so that the weights of the target network change much slower than those of the behaviour policy, which improves stability of the Q-targets.

Question 12

Q

what is the Rainbow paper?

Answer

A

the paper of a large experiment that combined 7 important DQN enhancements, tested on 57 Atari games (including DDQN, noisy DQN, Dueling DDQN)

Question 13

Q

what is ALE?

Answer

A

Arcade Learning Environment, a test-bed to stimulate research on challenging high-dimensional reinforcement learning tasks

Question 14

Q

what is end-to-end learning for Atari?

Answer

A

Learning from pixel to joystick: the network trains a behavior policy directly from pixel frame input

Question 15

Q

what is the biggest challenge in end-to-end learning?

Answer

A

Learning actions directly from high-dimensional sound and vision inputs

Question 16

Q

what is MuJoCo?

Answer

A

an environment for experimenting with simulated robotics