C3 Flashcards
why was the DQN that was able to play the Atari Breakout game such a big achievement?
- true eye-hand coordination of this complexity had not been achieved by a computer before
- the end-to-end learning from pixel to joystick implied artificial behavior that was close to how humans play games
- the instability of the deep learning process had been overcome
why is there a qualitative difference between small and large problems?
for small problems, the policy can be learned through loading all states in memory, so that the states are identified individually and each state has its own best action that we can try to find. But large problems don’t fit in memory, so we can’t memorize the policy and states are grouped together based on their features
what are the three problems with a naive deep Q-learner?
- convergence to the optimal Q-function depends on full coverage of the state space, but the state space is too large to fit in memory => no guarantee of convergence
- there is a strong correlation between subsequent training samples, with a risk of local optima
- the loss function of gradient descent has a moving target: both the new Q-value estimate and the old update target depend on the same parameters that are optimized => the optimization process can become unstable
why is correlation between subsequent training samples a problem?
this may result in biased training: the training may cover only a small part of the state space, so that there is too much exploitation and too little exploration. This results in bad generalization
what is the deadly triad?
three elements for divergent training: function approximation, bootstrapping and off-policy learning
the problem with function approximation
may attribute values to states inaccurately, because instead of identifying individual states exactly, neural networks are designed to individual features of states, which can be shared by different states => can cause mis-identification of states
the problem with bootstrapping
errors or biases in initial values may persist and spill over to other states if values are propagated incorrectly due to function approximation, because bootstrapping builds new values on the basis of older values
the problem with off-policy learning
it uses a behaviour policy that is different from the target policy we are optimizing for, so when the behaviour is improved, the off-policy values may not improve
why where experience replay and infrequent weight updates introduced in DQN?
to break correlations between subsequent states and to slow down the changes to parameters in the training process to improve stability
experience replay
it introduces a replay buffer, which is a cache of previously explored states, from which it samples training states at random => now we train states from a more diverse set instead of only from the most recent one => the next state to be trained is no longer a direct successor of the current state => increased independence of subsequent states => learning is spread out over more previously seen states
- improves coverage
- reduces correlation
infrequent weight updates
every n updates, the network Q is cloned to obtain the target network, which is used for generating the targets for the following n updates to Q, so that the weights of the target network change much slower than those of the behaviour policy, which improves stability of the Q-targets.
what is the Rainbow paper?
the paper of a large experiment that combined 7 important DQN enhancements, tested on 57 Atari games (including DDQN, noisy DQN, Dueling DDQN)
what is ALE?
Arcade Learning Environment, a test-bed to stimulate research on challenging high-dimensional reinforcement learning tasks
what is end-to-end learning for Atari?
Learning from pixel to joystick: the network trains a behavior policy directly from pixel frame input
what is the biggest challenge in end-to-end learning?
Learning actions directly from high-dimensional sound and vision inputs