Playing Atari with Deep RL Flashcards
Test paper knowledge
What is the e-greedy policy?
e-greedy is denoted by epsilon and is choosing a random action with a probability of e and choosing the greedy policy with probability e.
What is a markov decision process
A MDP is a process that has elements of randomness and is influenced partially by an agent.
It’s a process with discrete units of continuation (time) so there are discrete states and discrete actions that the agent sees and can take.
How to use MDP for agent environment simulation
To use MDP we say that the agent can take actions that then feed into the environment. The action ‘a’ and the current state ‘s’ are being fed into the environment. The environment then has its probability of moving to the next state ‘s_t’. After it has done so, the agent can get a reward that is compared to the way the agent is affected by the environment (positive for good reward and negative for bad reward).
What is the equation for the discounted return
The sum of the return at each stage is multiplied by a factor (between 0-1).
What is Q-network
It’s an NN which is a function approximator. It approximates the Q value at each stage for each action that can be taken.
Why to use Q-network and not a Q table
For large amount of states and actions there will be a memory problem and a time to compute probleme.
First, the amount of memory required to save and update that table would increase as the number of states increases.
Second, the amount of time required to explore each state to create the required Q-table would be unrealistic.
A network will receive a state and will produce the Q value for each action for that state.
How does the Q network learn the Q estimation?
By minimising the difference between predicted Q-values and the target Q-values.
Target Q value is the expected of: the observed reward and the discounted maximum predicted rewards of the next state.
What does gamma (decay factor) stand for?
A factor that is raised by the power of the time step. It is multiplied by the reward at each stage.
This factor makes the agent prefer the early rewards (low gamma) or later rewards (high gamma). Also can be interpreted as the probability of succeeding.
What will happen if gamma is low
The agent will discount future reward and will take more immediate action to receive early reward.
What will happen if gamma is high
The agent will value future rewards more highly and will consider long-term consequences.
What is the equation for Q learning (that is not deep)
At each stage that the agent takes an action the
Q_s_new = (1-alpha)Q_s_current+alpha(r_t_+_1+gamma*max_a’(Q_s’_a’))
alpha = learning factor
Gamma = decay factor
r = Reward from state s taking action a
What is the target network
The target network is a network with the same architecture as DQN. It is used in the training phase to give the target Q values for the DQN to learn.
How the target network is being trained
The target network is not being trained but its weights are being copied from the DQN after each period.
Write the equation for the optimal action-value function
Q∗(s, a) = maxπ E [Rt|st = s, at = a, π]
While Rt = PTt′=t γt′−trt’
Explain in words the optimal action-value function
It is the maximum expected return achievable by any policy from the state (s) after taking action (a), considering all possible future states.