Model Free RL Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is model free RL? What’s its goal and how does it get there?

A

Agent learns to make decisions solely on experience. Goal to maximise reward - done through learning optimal policy (way to decide actions) using the learnt value function (description of subjective value of states in the world).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the reward Hypothesis of model free RL? Why is is a problem?

A

Model free RL assumes goal of organisms/intelligent behaviour is to maximise reward. Problem is treats reward as something inherent to environment but is subjective, and people don’t always do things to maximise reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Markov Decision process? What are it’s components?

A

Way of formalising an RL environment.
States = observations of variable in the world, possible state a variable can be in
Reward function = positive feedback from being in a given state, represented numerically
Actions = legal operations can take from one state to another, things the agent can do
Transition function = description or how taking an action in a given state results in a change to a different state, how actions and states interact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of RL using MDP in a grid world

A

State = position of grid
Action = moving through grid e.g. up, down, left, right
Reward = fruit in a grid square
Transition function = action takes you to neighbouring grid square in that direction
Agent would move through grid world randomly until bump into reward. Value of the states leading to rewards backpropagates. Uses value function to learn optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the difference between the reward function and the value function

A

Reward function is in the world, describes state(s) in which there is reward. Value function is in the agent, describes how useful states are in getting agent to the reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is operant conditioning? What’s the law of effect and shaping?

A

Learning through trial and error, learning consequences of actions. Law of effect = actions rewarded repeated more often, punished repeated less often. Shaping = rewarding successive approximations to target behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give the delta rule and explain

A

Value of current action is equal to the value of doing that action previously (did it get agent closer to reward) plus the reward prediction error (was the reward from the current action expected based on value of that action previously) modified by the learning rate (how quickly the new reward updates the value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give three ways that model free RL/ the delta rule is a model for learning in the brain

A

Dopaminergic cells in midbrain signal reward prediction error (Hollerman and Schultz 1998). Cells in striatum code for action values (Samejima et al. 2005). Dopamine gating of connections between sensory input and action (Reynolds 2011)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Hollerman and Schultz 1998

A

Dopaminergic cells in the midbrain (ventral tegmental area) signal reward prediction error. Macaque monkeys little response to image know gives reward giving reward, peak activity when new image gave reward, activity declined over operant learning for new image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain Samejima et al. 2005

A

Cells in striatum code for action values. Macaque monkeys make voluntary saccades to left or right, varied reward probability. Cells responded as if coding for value of different saccade direction e.g. example cell most activity when reward probability for right high and least when reward probability for right low bit not change in activity when varied reward probability for left saccade

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain Reynolds et al. 2011

A

Dopamine gating of connection between sensory input and action. Measured Synaptic potentiation following intra-cranial self stimulation in animals (pressing leaver stimulated reward system). Hebbian learning strengthening connection between sensory input neuron and action neuron modulated by Dopamine from reward system, only enhanced in presence of reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does solving the Bellman equation give?

A

Optimal policy for maximising reward in an MDP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give and explain the Bellman equation

A

Value of the current state is equal to the reward from taking a certain action at the current state plus the discounted value of the next stateme. Compute recursively/back from the final state. Discount function gamma is to the power of n, n= number of states away from the current state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is there a discount function in the Bellman equation? What does its value do?

A

Without discounting value of all states equal to state with reward so unable to navigate to state with reward. Value determines importance of long term rewards compared to immediate rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the assumption of the simplified Bellman equation? How does it work in deterministic versus stochastic environments?

A

Assumes transition probabilities deterministic (certain action in given state always leads to certain state). Works just fine in deterministic environment but doesn’t work in stochastic environments. Value of state is calculated based on expected return from that state onwards and a deterministic expectation for this doesn’t work when in stochastic environments, means unable to accurately evaluate value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the full Bellman equation different to the simplified Bellman equation? How does it work in deterministic and stochastic environments?

A

Full Bellman equation accounts for transition probabilities. Probability of transitioning to a certain state when an action is taken from the current state. Value is always in deterministic environment so cancels out and works as the simplified Bellman equation. Requires knowing the transition probabilities/having a model of the world to learn in stochastic environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is it important to have model free RL that works in stochastic environments? If the Bellman equation doesn’t work, what can be used?

A

Model free less computationally expensive as no need to store model. Organisms learning don’t always have a model e.g. in new environments. Real world environments often aren’t deterministic. Temporal difference learning can be used.

18
Q

What is Temporal difference learning?

A

Method of calculating value function to make value based decisions (e.g., in an MDP) without needing to directly solve the Bellman equation. Uses prediction errors over time, updating the value of the previous state based on the reward and value of the current state.

19
Q

Explain the TD learning equation

A

The updated value of the previous state comes from the current value of the previous state, plus the learning rate multiplied by the temporal difference reward prediction error. This is how much better the current state is (reward and value) than the previous state (current value). Learning rate controls how much information keep from the current value and how much use from the update calculation in the updated value.

20
Q

Does TD learning solve the Bellman equation?

A

In a way, yes, but not directly. It approximates its principles of converging to an optimal policy through iterative updates whilst remaining model free in stochastic environments.

21
Q

What are the benefits of TD learning compared to the Bellman equation?

A

Computational efficiency, can handle incomplete data (no ned to reach the end state/episode to learn), sample efficiency/less experience required, can adapt to stochastic environments, good for online learning (can adapt to changing/non-stationary environments)

22
Q

What is Q learning?

A

Type of TD learning that uses a Q value for its value function which estimates the value of action state pairs (taking a certain action in a certain state) rather than just the value of the state. Uses a Q table of states x values. Can differentiate potential outcome of each action in each state to directly select best action for current state.

23
Q

What does it mean that Q learning is off policy?

A

Calculates Q vaues for optimal reward regardless of behaviour policies. Allows for policies that don’t always follow optimal Q values - those that balance exploration and exploitation. Means when the agent exploits/follows the Q values it gets the optimal reward, whilst allowing the agent to explore.

24
Q

Why is exploration (not just exploitation) important?

A

Makes sure agent can find best rewards e.g. doesn’t settle on intermediate reard first found as exploration means can keep looking for end goal. Makes sure the agent doesn’t get stuck e.g. not trying to exploit an action that isn’t producing a transition to a new state like if encountered an obstacle.

25
Q

What are some examples of explore/exploit strategies in Q learning?

A

Epsilon greedy strategy, softmax function

26
Q

What is an epsilon greedy strategy? How does it work?

A

Epsilon used to represent probability of exploring (random choice) and exploiting (following optimal Q values). Probability of epsilon means always explore, 1-epsilon means always exploit. Start set to epsilon=1, decays over time. Means agent explores more when it knows the least and exploits more wa it learns.

27
Q

How is the softmax function a strategy to balance exploration and exploitation?

A

Choose the value of a certain action in proportion to thr value of all action values. Probability of choosing certain action = Value of certain action/Value of all actions. Means frequency of choosing actions is in proportion to how good they are.

28
Q

How does the softmax function make sense with biological systems?

A

Softmax function is sigmoidal, biological systems have choice functions that are approximately sigmoidal. Reflects how biological systems can make choices that are intrinsically variable but broadly sensible.

29
Q

What is an eligibility trace in Q learning?

A

When visit state makted with eligibility trace, e. Initially set to high value and then decays as agent moves through more transitions. Rate of decay controlled by lambda. Value updated back along the path taken in proportion to how closely the state action pairs contributed to obtaining a reward.

30
Q

What did Fiorillo et al. 2003 and Flagel et al. 2011 find? What does this show about TD learning in the brain?

A

Fiorillo et al. 2003 - activity of dopamine neurons in macaque VTA signal reward prediction error, shifts from time of reward to cue indicating reward as probability or cue leading to reward increases/as learn cue indicators reward.
Flage et al. 2011 - dopamine in signalling in ventral striatum of rodents, measured dopamine concentrations, at first responded to onset of reward but with training shifted to onset of cue indicating reward
Show reward prediction errors and therefore value updates shifting/popogating back to previous states (cues)

31
Q

What are actor critic models? Why do they align with the idea of TD learning in the brain.

A

Split agent into actor which learns value of action in given state and crictic which learns value of states. Aligns with separatuon between learning value of stimuli (states) in classical conditioning and learning the value of actions in certain states in operant conditioning. Evidence of this separation in brain from O’Doherty et al. 2003 - fmri shows reward prediction errors in dorsal striatum during operant conditioning (actor) and in ventral striatum during classical conditioning (critic)

32
Q

What might align with eligibility traces at the neural level?

A

Synaptic tagging. Gernster 2018 review discuss evidence that synapses become tagged making them temporarily more receptive to subsequent stimulation which can lead to greater strengthening connection when reward received, but not entirely clear what might align

33
Q

Explain the limitations of RL using the four rooms problem as an example

A

Scales poorly to large environments with sparse rewards (like those in real world). Four rooms problem requires RL to go through specific bottleneck states (doorways), difficulty to do under initial random policy.

34
Q

What is Temporal abstraction? How does that help in environments such as the four rooms?

A

Method to make decisions over longer timescales than individual timesteps through clusters of actions. In four rooms problem this means instead of just primitive actions like up, down, left, right, clustered action sequences like go to the doorway.

35
Q

What is heirarchal Reinforcement learning/the options framework?

A

Botvinick er al. 2009 Special set of states that are predefined sub goals where an option (series of actions executed until completion) can be initiated or terminated. Value from pseudo rewards are backed up to the primitive actions leading to a termination state and when the end goal is reached this is backed up to the start of those primitive options initiatuin state as an “option”.

36
Q

What’s the benefit of HRL? What’s the issue?

A

Accelerates learning in environments that regular RL struggles in e.g. four rooms problem. Problem is this is only if the sub goals are in the right place, agents do worse if given sub goals not appropriate specified e.g. windows rather than doorways in four rooms. How to discover/learn sub goals (not just be given them) is unsolved, limits utility in real world settings, dependent on knowing how to split up environment into sub goals. Displaces problem of inefficient learning into one of best representing environment.

37
Q

What are the aspects of HRL that can be seen in the brain?

A

Mechanism to ensure option followed once initiated rather than other primitive actions. Monitoring for presence of termination state at which point can select new action. Prediction errors in relation to sub goals/pseudo rewards.

38
Q

Explain Kemerley et al. 2006

A

Monkeys learn set of actions to receive a reward and to switch to different set of actions at a certain point. Those with lesions to dACC would wrongly switch away from current action sequence more often but no difference in switching at correct time compared to controls. Shows dACC has crucial role in making sure an initiated action sequence (option) is continued.

39
Q

Explain Shidara and Richmond 2002

A

dACC neurons activity increase with proximity to reward in extended behaviours requiring sequences of actions to receive reward. Could be signalling proximity to germination points. Aligns with other findings of activity signalling proximity to switch point in tasks.

40
Q

Explain Ribas-Fernandes et al. 2012

A

Package delivery task where drive to pick up package (unrewarded sub goal) and then deliver it (rewarded end goal). Package could randomly change location, changed distance to Package whilst maintaining overall distance to end goal. dACC activity when subgoal changed location, aligns with reward prediction error for pseudo reward at sub goal.

41
Q

Explain the circuits for model free RL in the brain

A

Striatum codes value of actions in line with TD learning via gating signal from midbrain dopamine neurons. dACC fcaciloiattws extended action selection in way that resembles HRL. Botnivick 2008 suggests circuit mechanism of model free RL in the brain with striatum and ACC.