Playing Atari with Deep RL Flashcards

Question 1

Q

What is the e-greedy policy?

Answer

A

e-greedy is denoted by epsilon and is choosing a random action with a probability of e and choosing the greedy policy with probability e.

Question 2

Q

What is a markov decision process

Answer

A

A MDP is a process that has elements of randomness and is influenced partially by an agent.
It’s a process with discrete units of continuation (time) so there are discrete states and discrete actions that the agent sees and can take.

Question 3

Q

How to use MDP for agent environment simulation

Answer

A

To use MDP we say that the agent can take actions that then feed into the environment. The action ‘a’ and the current state ‘s’ are being fed into the environment. The environment then has its probability of moving to the next state ‘s_t’. After it has done so, the agent can get a reward that is compared to the way the agent is affected by the environment (positive for good reward and negative for bad reward).

Question 4

Q

What is the equation for the discounted return

Answer

A

The sum of the return at each stage is multiplied by a factor (between 0-1).

Question 5

Q

What is Q-network

Answer

A

It’s an NN which is a function approximator. It approximates the Q value at each stage for each action that can be taken.

Question 6

Q

Why to use Q-network and not a Q table

Answer

A

For large amount of states and actions there will be a memory problem and a time to compute probleme.

First, the amount of memory required to save and update that table would increase as the number of states increases.

Second, the amount of time required to explore each state to create the required Q-table would be unrealistic.

A network will receive a state and will produce the Q value for each action for that state.

Question 7

Q

How does the Q network learn the Q estimation?

Answer

A

By minimising the difference between predicted Q-values and the target Q-values.
Target Q value is the expected of: the observed reward and the discounted maximum predicted rewards of the next state.

Question 8

Q

What does gamma (decay factor) stand for?

Answer

A

A factor that is raised by the power of the time step. It is multiplied by the reward at each stage.
This factor makes the agent prefer the early rewards (low gamma) or later rewards (high gamma). Also can be interpreted as the probability of succeeding.

Question 9

Q

What will happen if gamma is low

Answer

A

The agent will discount future reward and will take more immediate action to receive early reward.

Question 10

Q

What will happen if gamma is high

Answer

A

The agent will value future rewards more highly and will consider long-term consequences.

Question 11

Q

What is the equation for Q learning (that is not deep)

Answer

A

At each stage that the agent takes an action the
Q_s_new = (1-alpha)Q_s_current+alpha(r_t_+_1+gamma*max_a’(Q_s’_a’))
alpha = learning factor
Gamma = decay factor
r = Reward from state s taking action a

Question 12

Q

What is the target network

Answer

A

The target network is a network with the same architecture as DQN. It is used in the training phase to give the target Q values for the DQN to learn.

Question 13

Q

How the target network is being trained

Answer

A

The target network is not being trained but its weights are being copied from the DQN after each period.

Question 14

Q

Write the equation for the optimal action-value function

Answer

A

Q∗(s, a) = maxπ E [Rt|st = s, at = a, π]
While Rt = PTt′=t γt′−trt’

Question 15

Q

Explain in words the optimal action-value function

Answer

A

It is the maximum expected return achievable by any policy from the state (s) after taking action (a), considering all possible future states.

Question 16

Q

Write the bellman equation

Answer

Study These Flashcards

A

Q (s,a)=Es′∼E r+γmaxQ (s,a)s,a
equation 1 in the paper.

Question 17

Q

Write a verbal definition of the bellman equation

Answer

Study These Flashcards

A

It is a recursive formula used to find the optimal policy in a decision process. It expresses the value of a decision as the immediate reward plus the value of the best discounted rewards for possible future decisions.

Question 18

Q

What are the key ideas of the DQN algorithm?

Answer

Study These Flashcards

A

Learn from a random batch of experience (doesn’t have to be from this same state).
Record the experience.
Probability epsilon to explore, or to take the best action.
Execute action and observe reward and next image
Preprocessing stage - downscaling, cropping, and taking 4 frames.
Target network for training stability.

Question 19

Q

Why is the total reward averaged over a few games a noisy metric?

Answer

Study These Flashcards

A

Cause changes in the weights of the network can lead to wins or losses strikes which makes the reward massively different.

Question 20

Q

Write the steps for the DQN learning algorithm

Answer

Study These Flashcards

A

Init memory D for experience.
Init Q network with random weights.
for episode in M
init state s and preprocess phi
for t till end of episode
e to take random action or best action.
after taking action a, observe reward r and next image x
Save a,r,x_old, x_new for the new state.
Preprocess s_new using downscale, crop.
Take a batch of exp and learn
r if finish
r+gamma*max(next(estimate)) if not finish
Perform gradient descent

Playing Atari with Deep RL Flashcards

Test paper knowledge (20 cards)