W3 Deep Value-based Flashcards
What is Gym?
A collection of reinforcement learning environments
What are the Stable Baselines?
A collection of algorithms, like gym but for the algorithms
The loss function of DQN uses the Q-function as target. What is a consequence?
The Q-function value keeps updating therefore using it as the target makes the convergence hard.
the loss function of deep Q-learning
minimizes a moving target, a target that depends on the network being optimized.
Why is the exploration/exploitation trade-off central in reinforcement learning?
We want the agent to utilize the current knowledge it possesses but also explore the environment so it won;t stuck in local optimas.
Name one simple exploration/exploitation method
Linear annealing/epsilon greedy/softmax
What is bootstrapping?
Bootstrapping in RL can be read as “using one or more estimated values in the update step for the same kind of estimated value”.
In most TD update rules, you will see something like this SARSA(0) update:
Q(s,a)←Q(s,a)+α(Rt+1+γQ(s′,a′)−Q(s,a))
The value Rt+1+γQ(s′,a′)
is an estimate for the true value of Q(s,a), and also called the TD target. It is a bootstrap method because we are in part using a Q value to update another Q value. There is a small amount of real observed data in the form of Rt+1, the immediate reward for the step, and also in the state transition s→s′.
Describe the architecture of the neural network in DQN with TN
The DQN architecture has two neural nets, the Q network and the Target networks, and a component called Experience Replay. The Q network is the agent that is trained to produce the Optimal State-Action value.
Why is deep reinforcement learning more susceptible to unstable learning than deep supervised learning?
Because the target value keeps changing, making the learning process high-variance. Converging to a moving target is also hard
What is the deadly triad?
function approximation, bootstrapping, and off-
policy learning. Together, they are called deadly triad.
How does function approximation reduce stability of Q-learning?
Function approximation may attribute values to states inaccurately It may thus cause mis-identification of states, and reward values and Q-values that are not assigned correctly
What is the role of the replay buffer?
The replay buffer serves as the memory of the network, and decreases correlation between samples.
How can correlation between states lead to local minima?
Due to correlation between different states, it may result in biased training. The bias can result in the so-called specialization trap (when there is too much exploitation, and too little exploration).
Why should the coverage of the state space be sufficient?
Because otherwise we might not cover the optimal solution
What happens when deep reinforcement learning algorithms do not converge?
Algorithms can have trouble with the moving target, since it is also based on the parameters we are optimizing.
How large is the state space of chess, Go, StarCraft estimated to be? 10^47, 10^170 or 10^1685?
10^47, 10^170 or 10^1685, respectively
What does the rainbow in the Rainbow paper stand for, and what is the main message?
It is an ensemble of 7 different improvements on the original DQN. By combining all their strengths, the rainbow model outperforms them by a large margin.
Which statement about the benefit of DQN compared to tabular Q-learning is True? (pick the most convincing reason)
A DQN is faster.
B DQN outperforms tabular Q-learning.
C DQN can better deal with high-dimensional input.
D DQN is more data-efficient.
C
Zhao is implementing a replay buffer for DQN and was wondering whether you
had some tips regarding sampling methods. Your recommendation is:
A Prioritized Experience Replay.
B Uniform sampling.
C Compare both to find out which one works best on his problem, as their performance varies per application.
C
Why is diversity important in learning?
A Through de-correlation it improves stability in reinforcement learning.
B Through correlation it prevents over-generalization in supervised learning.
C Through correlation it prevents over-generalization in reinforcement learning.
D Through de-correlation it improves stability in supervised learning.
A
Which of the following DQN Extensions addresses overestimated action values?
Double DQN
Dueling DQN
Prioritized Action Replay
Distributional DQN
Double DQN