Playing Atari with Deep RL Flashcards

Test paper knowledge

1
Q

What is the e-greedy policy?

A

e-greedy is denoted by epsilon and is choosing a random action with a probability of e and choosing the greedy policy with probability e.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a markov decision process

A

A MDP is a process that has elements of randomness and is influenced partially by an agent.
It’s a process with discrete units of continuation (time) so there are discrete states and discrete actions that the agent sees and can take.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to use MDP for agent environment simulation

A

To use MDP we say that the agent can take actions that then feed into the environment. The action ‘a’ and the current state ‘s’ are being fed into the environment. The environment then has its probability of moving to the next state ‘s_t’. After it has done so, the agent can get a reward that is compared to the way the agent is affected by the environment (positive for good reward and negative for bad reward).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the equation for the discounted return

A

The sum of the return at each stage is multiplied by a factor (between 0-1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Q-network

A

It’s an NN which is a function approximator. It approximates the Q value at each stage for each action that can be taken.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why to use Q-network and not a Q table

A

For large amount of states and actions there will be a memory problem and a time to compute probleme.

First, the amount of memory required to save and update that table would increase as the number of states increases.

Second, the amount of time required to explore each state to create the required Q-table would be unrealistic.

A network will receive a state and will produce the Q value for each action for that state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the Q network learn the Q estimation?

A

By minimising the difference between predicted Q-values and the target Q-values.
Target Q value is the expected of: the observed reward and the discounted maximum predicted rewards of the next state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does gamma (decay factor) stand for?

A

A factor that is raised by the power of the time step. It is multiplied by the reward at each stage.
This factor makes the agent prefer the early rewards (low gamma) or later rewards (high gamma). Also can be interpreted as the probability of succeeding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What will happen if gamma is low

A

The agent will discount future reward and will take more immediate action to receive early reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What will happen if gamma is high

A

The agent will value future rewards more highly and will consider long-term consequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the equation for Q learning (that is not deep)

A

At each stage that the agent takes an action the
Q_s_new = (1-alpha)Q_s_current+alpha(r_t_+_1+gamma*max_a’(Q_s’_a’))
alpha = learning factor
Gamma = decay factor
r = Reward from state s taking action a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the target network

A

The target network is a network with the same architecture as DQN. It is used in the training phase to give the target Q values for the DQN to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How the target network is being trained

A

The target network is not being trained but its weights are being copied from the DQN after each period.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Write the equation for the optimal action-value function

A

Q∗(s, a) = maxπ E [Rt|st = s, at = a, π]
While Rt = PTt′=t γt′−trt’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain in words the optimal action-value function

A

It is the maximum expected return achievable by any policy from the state (s) after taking action (a), considering all possible future states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Write the bellman equation

A

Q (s,a)=Es′∼E r+γmaxQ (s,a)s,a
equation 1 in the paper.

17
Q

Write a verbal definition of the bellman equation

A

It is a recursive formula used to find the optimal policy in a decision process. It expresses the value of a decision as the immediate reward plus the value of the best discounted rewards for possible future decisions.

18
Q

What are the key ideas of the DQN algorithm?

A
  1. Learn from a random batch of experience (doesn’t have to be from this same state).
  2. Record the experience.
  3. Probability epsilon to explore, or to take the best action.
  4. Execute action and observe reward and next image
  5. Preprocessing stage - downscaling, cropping, and taking 4 frames.
  6. Target network for training stability.
19
Q

Why is the total reward averaged over a few games a noisy metric?

A

Cause changes in the weights of the network can lead to wins or losses strikes which makes the reward massively different.

20
Q

Write the steps for the DQN learning algorithm

A

Init memory D for experience.
Init Q network with random weights.
for episode in M
init state s and preprocess phi
for t till end of episode
e to take random action or best action.
after taking action a, observe reward r and next image x
Save a,r,x_old, x_new for the new state.
Preprocess s_new using downscale, crop.
Take a batch of exp and learn
r if finish
r+gamma*max(next(estimate)) if not finish
Perform gradient descent