Model Free RL Flashcards
What is model free RL? What’s its goal and how does it get there?
Agent learns to make decisions solely on experience. Goal to maximise reward - done through learning optimal policy (way to decide actions) using the learnt value function (description of subjective value of states in the world).
What is the reward Hypothesis of model free RL? Why is is a problem?
Model free RL assumes goal of organisms/intelligent behaviour is to maximise reward. Problem is treats reward as something inherent to environment but is subjective, and people don’t always do things to maximise reward.
What is a Markov Decision process? What are it’s components?
Way of formalising an RL environment.
States = observations of variable in the world, possible state a variable can be in
Reward function = positive feedback from being in a given state, represented numerically
Actions = legal operations can take from one state to another, things the agent can do
Transition function = description or how taking an action in a given state results in a change to a different state, how actions and states interact
Give an example of RL using MDP in a grid world
State = position of grid
Action = moving through grid e.g. up, down, left, right
Reward = fruit in a grid square
Transition function = action takes you to neighbouring grid square in that direction
Agent would move through grid world randomly until bump into reward. Value of the states leading to rewards backpropagates. Uses value function to learn optimal policy.
Explain the difference between the reward function and the value function
Reward function is in the world, describes state(s) in which there is reward. Value function is in the agent, describes how useful states are in getting agent to the reward.
What is operant conditioning? What’s the law of effect and shaping?
Learning through trial and error, learning consequences of actions. Law of effect = actions rewarded repeated more often, punished repeated less often. Shaping = rewarding successive approximations to target behaviour
Give the delta rule and explain
Value of current action is equal to the value of doing that action previously (did it get agent closer to reward) plus the reward prediction error (was the reward from the current action expected based on value of that action previously) modified by the learning rate (how quickly the new reward updates the value)
Give three ways that model free RL/ the delta rule is a model for learning in the brain
Dopaminergic cells in midbrain signal reward prediction error (Hollerman and Schultz 1998). Cells in striatum code for action values (Samejima et al. 2005). Dopamine gating of connections between sensory input and action (Reynolds 2011)
Explain Hollerman and Schultz 1998
Dopaminergic cells in the midbrain (ventral tegmental area) signal reward prediction error. Macaque monkeys little response to image know gives reward giving reward, peak activity when new image gave reward, activity declined over operant learning for new image.
Explain Samejima et al. 2005
Cells in striatum code for action values. Macaque monkeys make voluntary saccades to left or right, varied reward probability. Cells responded as if coding for value of different saccade direction e.g. example cell most activity when reward probability for right high and least when reward probability for right low bit not change in activity when varied reward probability for left saccade
Explain Reynolds et al. 2011
Dopamine gating of connection between sensory input and action. Measured Synaptic potentiation following intra-cranial self stimulation in animals (pressing leaver stimulated reward system). Hebbian learning strengthening connection between sensory input neuron and action neuron modulated by Dopamine from reward system, only enhanced in presence of reward
What does solving the Bellman equation give?
Optimal policy for maximising reward in an MDP
Give and explain the Bellman equation
Value of the current state is equal to the reward from taking a certain action at the current state plus the discounted value of the next stateme. Compute recursively/back from the final state. Discount function gamma is to the power of n, n= number of states away from the current state.
Why is there a discount function in the Bellman equation? What does its value do?
Without discounting value of all states equal to state with reward so unable to navigate to state with reward. Value determines importance of long term rewards compared to immediate rewards.
What’s the assumption of the simplified Bellman equation? How does it work in deterministic versus stochastic environments?
Assumes transition probabilities deterministic (certain action in given state always leads to certain state). Works just fine in deterministic environment but doesn’t work in stochastic environments. Value of state is calculated based on expected return from that state onwards and a deterministic expectation for this doesn’t work when in stochastic environments, means unable to accurately evaluate value.
How is the full Bellman equation different to the simplified Bellman equation? How does it work in deterministic and stochastic environments?
Full Bellman equation accounts for transition probabilities. Probability of transitioning to a certain state when an action is taken from the current state. Value is always in deterministic environment so cancels out and works as the simplified Bellman equation. Requires knowing the transition probabilities/having a model of the world to learn in stochastic environments.
Why is it important to have model free RL that works in stochastic environments? If the Bellman equation doesn’t work, what can be used?
Model free less computationally expensive as no need to store model. Organisms learning don’t always have a model e.g. in new environments. Real world environments often aren’t deterministic. Temporal difference learning can be used.
What is Temporal difference learning?
Method of calculating value function to make value based decisions (e.g., in an MDP) without needing to directly solve the Bellman equation. Uses prediction errors over time, updating the value of the previous state based on the reward and value of the current state.
Explain the TD learning equation
The updated value of the previous state comes from the current value of the previous state, plus the learning rate multiplied by the temporal difference reward prediction error. This is how much better the current state is (reward and value) than the previous state (current value). Learning rate controls how much information keep from the current value and how much use from the update calculation in the updated value.
Does TD learning solve the Bellman equation?
In a way, yes, but not directly. It approximates its principles of converging to an optimal policy through iterative updates whilst remaining model free in stochastic environments.
What are the benefits of TD learning compared to the Bellman equation?
Computational efficiency, can handle incomplete data (no ned to reach the end state/episode to learn), sample efficiency/less experience required, can adapt to stochastic environments, good for online learning (can adapt to changing/non-stationary environments)
What is Q learning?
Type of TD learning that uses a Q value for its value function which estimates the value of action state pairs (taking a certain action in a certain state) rather than just the value of the state. Uses a Q table of states x values. Can differentiate potential outcome of each action in each state to directly select best action for current state.
What does it mean that Q learning is off policy?
Calculates Q vaues for optimal reward regardless of behaviour policies. Allows for policies that don’t always follow optimal Q values - those that balance exploration and exploitation. Means when the agent exploits/follows the Q values it gets the optimal reward, whilst allowing the agent to explore.
Why is exploration (not just exploitation) important?
Makes sure agent can find best rewards e.g. doesn’t settle on intermediate reard first found as exploration means can keep looking for end goal. Makes sure the agent doesn’t get stuck e.g. not trying to exploit an action that isn’t producing a transition to a new state like if encountered an obstacle.