Model Free RL Flashcards

Question

What are some examples of explore/exploit strategies in Q learning?

Answer 1

Epsilon greedy strategy, softmax function

Answer 2

Epsilon used to represent probability of exploring (random choice) and exploiting (following optimal Q values). Probability of epsilon means always explore, 1-epsilon means always exploit. Start set to epsilon=1, decays over time. Means agent explores more when it knows the least and exploits more wa it learns.

Answer 3

Choose the value of a certain action in proportion to thr value of all action values. Probability of choosing certain action = Value of certain action/Value of all actions. Means frequency of choosing actions is in proportion to how good they are.

Answer 4

Softmax function is sigmoidal, biological systems have choice functions that are approximately sigmoidal. Reflects how biological systems can make choices that are intrinsically variable but broadly sensible.

Answer 5

When visit state makted with eligibility trace, e. Initially set to high value and then decays as agent moves through more transitions. Rate of decay controlled by lambda. Value updated back along the path taken in proportion to how closely the state action pairs contributed to obtaining a reward.

Answer 6

Fiorillo et al. 2003 - activity of dopamine neurons in macaque VTA signal reward prediction error, shifts from time of reward to cue indicating reward as probability or cue leading to reward increases/as learn cue indicators reward. Flage et al. 2011 - dopamine in signalling in ventral striatum of rodents, measured dopamine concentrations, at first responded to onset of reward but with training shifted to onset of cue indicating reward Show reward prediction errors and therefore value updates shifting/popogating back to previous states (cues)

Answer 7

Split agent into actor which learns value of action in given state and crictic which learns value of states. Aligns with separatuon between learning value of stimuli (states) in classical conditioning and learning the value of actions in certain states in operant conditioning. Evidence of this separation in brain from O'Doherty et al. 2003 - fmri shows reward prediction errors in dorsal striatum during operant conditioning (actor) and in ventral striatum during classical conditioning (critic)

Answer 8

Synaptic tagging. Gernster 2018 review discuss evidence that synapses become tagged making them temporarily more receptive to subsequent stimulation which can lead to greater strengthening connection when reward received, but not entirely clear what might align

Answer 9

Scales poorly to large environments with sparse rewards (like those in real world). Four rooms problem requires RL to go through specific bottleneck states (doorways), difficulty to do under initial random policy.

Answer 10

Method to make decisions over longer timescales than individual timesteps through clusters of actions. In four rooms problem this means instead of just primitive actions like up, down, left, right, clustered action sequences like go to the doorway.

Answer 11

Botvinick er al. 2009 Special set of states that are predefined sub goals where an option (series of actions executed until completion) can be initiated or terminated. Value from pseudo rewards are backed up to the primitive actions leading to a termination state and when the end goal is reached this is backed up to the start of those primitive options initiatuin state as an "option".

Answer 12

Accelerates learning in environments that regular RL struggles in e.g. four rooms problem. Problem is this is only if the sub goals are in the right place, agents do worse if given sub goals not appropriate specified e.g. windows rather than doorways in four rooms. How to discover/learn sub goals (not just be given them) is unsolved, limits utility in real world settings, dependent on knowing how to split up environment into sub goals. Displaces problem of inefficient learning into one of best representing environment.

Answer 13

Mechanism to ensure option followed once initiated rather than other primitive actions. Monitoring for presence of termination state at which point can select new action. Prediction errors in relation to sub goals/pseudo rewards.

Answer 14

Monkeys learn set of actions to receive a reward and to switch to different set of actions at a certain point. Those with lesions to dACC would wrongly switch away from current action sequence more often but no difference in switching at correct time compared to controls. Shows dACC has crucial role in making sure an initiated action sequence (option) is continued.

Answer 15

dACC neurons activity increase with proximity to reward in extended behaviours requiring sequences of actions to receive reward. Could be signalling proximity to germination points. Aligns with other findings of activity signalling proximity to switch point in tasks.

Answer 16

Package delivery task where drive to pick up package (unrewarded sub goal) and then deliver it (rewarded end goal). Package could randomly change location, changed distance to Package whilst maintaining overall distance to end goal. dACC activity when subgoal changed location, aligns with reward prediction error for pseudo reward at sub goal.

Answer 17

Striatum codes value of actions in line with TD learning via gating signal from midbrain dopamine neurons. dACC fcaciloiattws extended action selection in way that resembles HRL. Botnivick 2008 suggests circuit mechanism of model free RL in the brain with striatum and ACC.

Model Free RL Flashcards

(41 cards)