Reinforcement Learning Flashcards
What is a value function?
An evaluation of the reward associated with actions in a certain context of the world.
What is the goal of reinforcement learning?
Maximizing a cumulative reward.
What is a policy?
Maps states to actions using value functions.
What is the VTA?
Ventral Tegmental Area: source of dopaminergic neurons.
Which neurotransmitter is produced at the VTA?
Describe instrumental conditioning.
An association between an action and rewards (or punishments).
Also called ‘operant’ conditioning, or ‘the law of effect’.
What is dopamine signalling?
Expected reward (not simply a reward amount).
What is the difference between model based and model free learning?
Model based learning attempts to make predictions on the basis of a model of the world.
What is the divergence of VTA connections?
500.000 connections per neuron (about 50x more than the “average” cortical neuron)
What’s the Markov property (as in a Markov Decision Process)?
Only the present matters for a decision about an action.
What is the ‘exploit x explore’ dilemma?
That it is not possible to simultaneously learn about the world (explore) and maximize a reward (exploit). Organisms need to find the middle.
Why is reinforcement learning a ‘normative framework’?
It doesn’t specify what agents will do, but what they should do.
What is “classical conditioning”?
Pairing a neutral stimulus with an unconditioned response (Pavlovian conditioning).
E.g.: Unconditioned response (salivating) is evoked by neutral stimulus (bell sounding).
What is the signal thought to be produced in the nucleus accumbens?
A critic, a system that gives feedback on how well a system produced prediction about rewards.
What is “TD” Learning?
Temporal Difference learning. The idea that rewards always follow behavior (rewards come after a delay).
What is the difference between model free and model based reinforcement learning?
Model free is simply an association between recent actions and a reward.
In model based learning, an “agent” builds a model of the world to use that as a source of predictions about potential rewards.