W3 Deep Value-based Flashcards

Question 1

Q

What is Gym?

Answer

A

A collection of reinforcement learning environments

Question 2

Q

What are the Stable Baselines?

Answer

A

A collection of algorithms, like gym but for the algorithms

Question 3

Q

The loss function of DQN uses the Q-function as target. What is a consequence?

Answer

A

The Q-function value keeps updating therefore using it as the target makes the convergence hard.

the loss function of deep Q-learning
minimizes a moving target, a target that depends on the network being optimized.

Question 4

Q

Why is the exploration/exploitation trade-off central in reinforcement learning?

Answer

A

We want the agent to utilize the current knowledge it possesses but also explore the environment so it won;t stuck in local optimas.

Question 5

Q

Name one simple exploration/exploitation method

Answer

A

Linear annealing/epsilon greedy/softmax

Question 6

Q

What is bootstrapping?

Answer

A

Bootstrapping in RL can be read as “using one or more estimated values in the update step for the same kind of estimated value”.

In most TD update rules, you will see something like this SARSA(0) update:

Q(s,a)←Q(s,a)+α(Rt+1+γQ(s′,a′)−Q(s,a))

The value Rt+1+γQ(s′,a′)
is an estimate for the true value of Q(s,a), and also called the TD target. It is a bootstrap method because we are in part using a Q value to update another Q value. There is a small amount of real observed data in the form of Rt+1, the immediate reward for the step, and also in the state transition s→s′.

Question 7

Q

Describe the architecture of the neural network in DQN with TN

Answer

A

The DQN architecture has two neural nets, the Q network and the Target networks, and a component called Experience Replay. The Q network is the agent that is trained to produce the Optimal State-Action value.

Question 8

Q

Why is deep reinforcement learning more susceptible to unstable learning than deep supervised learning?

Answer

A

Because the target value keeps changing, making the learning process high-variance. Converging to a moving target is also hard

Question 9

Q

What is the deadly triad?

Answer

A

function approximation, bootstrapping, and off-
policy learning. Together, they are called deadly triad.

Question 10

Q

How does function approximation reduce stability of Q-learning?

Answer

A

Function approximation may attribute values to states inaccurately It may thus cause mis-identification of states, and reward values and Q-values that are not assigned correctly

Question 11

Q

What is the role of the replay buffer?

Answer

A

The replay buffer serves as the memory of the network, and decreases correlation between samples.

Question 12

Q

How can correlation between states lead to local minima?

Answer

A

Due to correlation between different states, it may result in biased training. The bias can result in the so-called specialization trap (when there is too much exploitation, and too little exploration).

Question 13

Q

Why should the coverage of the state space be sufficient?

Answer

A

Because otherwise we might not cover the optimal solution

Question 14

Q

What happens when deep reinforcement learning algorithms do not converge?

Answer

A

Algorithms can have trouble with the moving target, since it is also based on the parameters we are optimizing.

Question 15

Q

How large is the state space of chess, Go, StarCraft estimated to be? 10^47, 10^170 or 10^1685?

Answer

A

10^47, 10^170 or 10^1685, respectively

Question 16

Q

What does the rainbow in the Rainbow paper stand for, and what is the main message?

Answer

Study These Flashcards

A

It is an ensemble of 7 different improvements on the original DQN. By combining all their strengths, the rainbow model outperforms them by a large margin.

Question 17

Q

Which statement about the benefit of DQN compared to tabular Q-learning is True? (pick the most convincing reason)
A DQN is faster.
B DQN outperforms tabular Q-learning.
C DQN can better deal with high-dimensional input.
D DQN is more data-efficient.

Answer

Study These Flashcards

A

C

Question 18

Q

Zhao is implementing a replay buffer for DQN and was wondering whether you
had some tips regarding sampling methods. Your recommendation is:
A Prioritized Experience Replay.
B Uniform sampling.
C Compare both to find out which one works best on his problem, as their performance varies per application.

Answer

Study These Flashcards

A

C

Question 19

Q

Why is diversity important in learning?
A Through de-correlation it improves stability in reinforcement learning.
B Through correlation it prevents over-generalization in supervised learning.
C Through correlation it prevents over-generalization in reinforcement learning.
D Through de-correlation it improves stability in supervised learning.

Answer

Study These Flashcards

A

Question 20

Q

Which of the following DQN Extensions addresses overestimated action values?
 Double DQN
 Dueling DQN
 Prioritized Action Replay
 Distributional DQN

Answer

Study These Flashcards

A

Double DQN

W3 Deep Value-based Flashcards

(20 cards)