Model Free RL Flashcards
What is model free RL? What’s its goal and how does it get there?
Agent learns to make decisions solely on experience. Goal to maximise reward - done through learning optimal policy (way to decide actions) using the learnt value function (description of subjective value of states in the world).
What is the reward Hypothesis of model free RL? Why is is a problem?
Model free RL assumes goal of organisms/intelligent behaviour is to maximise reward. Problem is treats reward as something inherent to environment but is subjective, and people don’t always do things to maximise reward.
What is a Markov Decision process? What are it’s components?
Way of formalising an RL environment.
States = observations of variable in the world, possible state a variable can be in
Reward function = positive feedback from being in a given state, represented numerically
Actions = legal operations can take from one state to another, things the agent can do
Transition function = description or how taking an action in a given state results in a change to a different state, how actions and states interact
Give an example of RL using MDP in a grid world
State = position of grid
Action = moving through grid e.g. up, down, left, right
Reward = fruit in a grid square
Transition function = action takes you to neighbouring grid square in that direction
Agent would move through grid world randomly until bump into reward. Value of the states leading to rewards backpropagates. Uses value function to learn optimal policy.
Explain the difference between the reward function and the value function
Reward function is in the world, describes state(s) in which there is reward. Value function is in the agent, describes how useful states are in getting agent to the reward.
What is operant conditioning? What’s the law of effect and shaping?
Learning through trial and error, learning consequences of actions. Law of effect = actions rewarded repeated more often, punished repeated less often. Shaping = rewarding successive approximations to target behaviour
Give the delta rule and explain
Value of current action is equal to the value of doing that action previously (did it get agent closer to reward) plus the reward prediction error (was the reward from the current action expected based on value of that action previously) modified by the learning rate (how quickly the new reward updates the value)
Give three ways that model free RL/ the delta rule is a model for learning in the brain
Dopaminergic cells in midbrain signal reward prediction error (Hollerman and Schultz 1998). Cells in striatum code for action values (Samejima et al. 2005). Dopamine gating of connections between sensory input and action (Reynolds 2011)
Explain Hollerman and Schultz 1998
Dopaminergic cells in the midbrain (ventral tegmental area) signal reward prediction error. Macaque monkeys little response to image know gives reward giving reward, peak activity when new image gave reward, activity declined over operant learning for new image.
Explain Samejima et al. 2005
Cells in striatum code for action values. Macaque monkeys make voluntary saccades to left or right, varied reward probability. Cells responded as if coding for value of different saccade direction e.g. example cell most activity when reward probability for right high and least when reward probability for right low bit not change in activity when varied reward probability for left saccade
Explain Reynolds et al. 2011
Dopamine gating of connection between sensory input and action. Measured Synaptic potentiation following intra-cranial self stimulation in animals (pressing leaver stimulated reward system). Hebbian learning strengthening connection between sensory input neuron and action neuron modulated by Dopamine from reward system, only enhanced in presence of reward
What does solving the Bellman equation give?
Optimal policy for maximising reward in an MDP
Give and explain the Bellman equation
Value of the current state is equal to the reward from taking a certain action at the current state plus the discounted value of the next stateme. Compute recursively/back from the final state. Discount function gamma is to the power of n, n= number of states away from the current state.
Why is there a discount function in the Bellman equation? What does its value do?
Without discounting value of all states equal to state with reward so unable to navigate to state with reward. Value determines importance of long term rewards compared to immediate rewards.
What’s the assumption of the simplified Bellman equation? How does it work in deterministic versus stochastic environments?
Assumes transition probabilities deterministic (certain action in given state always leads to certain state). Works just fine in deterministic environment but doesn’t work in stochastic environments. Value of state is calculated based on expected return from that state onwards and a deterministic expectation for this doesn’t work when in stochastic environments, means unable to accurately evaluate value.
How is the full Bellman equation different to the simplified Bellman equation? How does it work in deterministic and stochastic environments?
Full Bellman equation accounts for transition probabilities. Probability of transitioning to a certain state when an action is taken from the current state. Value is always in deterministic environment so cancels out and works as the simplified Bellman equation. Requires knowing the transition probabilities/having a model of the world to learn in stochastic environments.