8 - Deep Reinforcement Learning Flashcards
Reinforcement Learning
Solve via trial and error to maximise rewards
Deep q learnign
Use a neural network (CNN) to predict the Q values and then select the biggest Q value
Deep Q Learning Network Architecture
3 conv layers and 2 fully connected
How to fix: Consecutive samples might be correlated (Deep Q Learning)
Store the agent’s experiences and randomly create mini-batches from the pool of stored samples.
(Gives a variety of data)
How to fix: Small updates to Q value may significantly change the policy
Update the network weights every 10000 steps rather than each individual step. (update target model every 10k steps)
Epsilon value
Starts at 1.
Refers to the probability of choosing to explore.
Exploration
Eg select a random action
Allows an agent to improve its current knowledge
Exploitation
Behave as the robot has learnt so far.
Choose the greedy action to get the most reward by exploiting the agent’s current action-value estimates. May be sub-optimal
Epsilon-Greedy Action Selection
With probability epsilon ε, select random action a,
otherwise select at = argmaxaQ(Φ(st),a;θ)
balance exploration/exploitation
Store transition
Store state to solve continuous frame problem.
Sample random minibatch of transitions
Use random function to select data from memory