W3 Deep Value-based Flashcards
What is Gym?
A collection of reinforcement learning environments
What are the Stable Baselines?
A collection of algorithms, like gym but for the algorithms
The loss function of DQN uses the Q-function as target. What is a consequence?
The Q-function value keeps updating therefore using it as the target makes the convergence hard.
the loss function of deep Q-learning
minimizes a moving target, a target that depends on the network being optimized.
Why is the exploration/exploitation trade-off central in reinforcement learning?
We want the agent to utilize the current knowledge it possesses but also explore the environment so it won;t stuck in local optimas.
Name one simple exploration/exploitation method
Linear annealing/epsilon greedy/softmax
What is bootstrapping?
Bootstrapping in RL can be read as “using one or more estimated values in the update step for the same kind of estimated value”.
In most TD update rules, you will see something like this SARSA(0) update:
Q(s,a)←Q(s,a)+α(Rt+1+γQ(s′,a′)−Q(s,a))
The value Rt+1+γQ(s′,a′)
is an estimate for the true value of Q(s,a), and also called the TD target. It is a bootstrap method because we are in part using a Q value to update another Q value. There is a small amount of real observed data in the form of Rt+1, the immediate reward for the step, and also in the state transition s→s′.
Describe the architecture of the neural network in DQN with TN
The DQN architecture has two neural nets, the Q network and the Target networks, and a component called Experience Replay. The Q network is the agent that is trained to produce the Optimal State-Action value.
Why is deep reinforcement learning more susceptible to unstable learning than deep supervised learning?
Because the target value keeps changing, making the learning process high-variance. Converging to a moving target is also hard
What is the deadly triad?
function approximation, bootstrapping, and off-
policy learning. Together, they are called deadly triad.
How does function approximation reduce stability of Q-learning?
Function approximation may attribute values to states inaccurately It may thus cause mis-identification of states, and reward values and Q-values that are not assigned correctly
What is the role of the replay buffer?
The replay buffer serves as the memory of the network, and decreases correlation between samples.
How can correlation between states lead to local minima?
Due to correlation between different states, it may result in biased training. The bias can result in the so-called specialization trap (when there is too much exploitation, and too little exploration).
Why should the coverage of the state space be sufficient?
Because otherwise we might not cover the optimal solution
What happens when deep reinforcement learning algorithms do not converge?
Algorithms can have trouble with the moving target, since it is also based on the parameters we are optimizing.
How large is the state space of chess, Go, StarCraft estimated to be? 10^47, 10^170 or 10^1685?
10^47, 10^170 or 10^1685, respectively