week 8 - reinforcement learning Flashcards
what is the difference between supervised learning, unsupervised learning and reinforcement learning?
supervised = learns a mapping between data and labels
unsupervised = discovers patterns in the data
reinforcement = a form of supervised learning that learns based on a reward signal
what is a pro of reinforcement learning?
it can succeed in solving very complex problems that other ML models cant
e.g it can be trained to play games like Go,
it can also be used to explain human learning. Lots of evidence suggests that components of reinforcement learning algorithms appear to be represented in the brain
how do we represent a reinforcement learning problem?
As a markov decision process
MDP have 4 components:
State (any concrete position in some sort of physical or abstract space)
Action (any actions we can take)
Transition probabilities (the probabilitity of transitioning to another state given a particular action)
Reward (the reward recieved when taking each action in each state)
the goal of the MDP is to find the best way to act (optimal policy)
what can be represented as a markov decision process?
basically anything
it can be used in motion control problems, it can be used in social interactions, it can be used in games like GO
what is model free reinforcement learning and pros and cons?
the process of learning which actions produce the most reward, through trial and error, without actualy needing a model’ of the dynamics of the environment.
All the learning goes off is consequences alone.
This is very computationally efficient, however it is slow because it relies on trial and error, and inflexible because if the world changes then the algorithm doesn’t work any more
what is the sarsa algorithm?
SARSA: State-Action-Reward-State-Action, an on-policy reinforcement learning algorithm.
Model-free: Does not require knowledge of the environment’s dynamics.
Its a type of temporal difference learning
SARSA learns the value of actions based on the reward it recieves. It refers to the values of actions as Q
In each step (t), it learns the Q values of each action in each state. It compares the reward gained in each step, from the reward it had previously (prediction error). As the model learns, it updates the Q values
The prediction error gets weighted by a learning rate,
So essentially, in each state we update the value of a particular action in a particular state by multiplying the expected Q value based on previously, with the prediction error x learning rate, and with the future Q value of the future action
As expected Q gets more accurate, the prediction error will decrease and the updating will converge onto the optimal policy
What is a con of markovian ML with immediate rewards? how does sarsa mitigate this?
It only tells us about immediate rewards
It doesn’t model actions based on long term rewards.
To mitigate this, SARS encorporate future rewards into the algorithm. This means we don’t just learn from reward, but also from the expected Q values of the next actions we take. This means that future rewards can propogate backwards
This is refered to as temporal difference prediction error. It takes into account temporal differences as well as reward based prediction error.
This allows us to learn action values whilst accounting for long term rewards. This algorithm is recursive becasue the actions that we learn feed back into the updating of other action values
What makes a decision process markovian?
Any action that we decide to take is entirely dependent on where we are now, not on any previous state
what is bootsrapping in RL?
learning values based on other learnt values
In RL, the learned action values feedback into other learnt action values.This allows us to learn which algorithms lead to the best rewards
what does SARSA tell us about how humans leaern
Dopamine has been proposed to signal the temporal difference prediction error
Dopamine neurons fire more when theres an unexpected reward
what is the explore/exploit problem? how to mitigate
Should an agent exploit what it has learnt to keep getting a reasonable amount of reward
Or should it explore other options in the hope of learning how to gain more reward
You can solve this by implementing a decision rule that allows for exploration. This ensures that sometimes we will choose other options, rather than just the policy that chooses the most reward as we have learnt so far
what is the epsilon greedy rule?
A decision rule that ensures we will sometimes explore rather than exploit what we already know
The rule says you choose the best option most of the time but sometimes you choose randomly
The E parameter determines what percentage of the time you choose randomly
what is model-based reinforcement learning?
Using a model of the environment to determine which actions are best
This usually uses a simulation process
This removes the need for trial and error, as the model has knowledge of the environment and knows where rewards lie
These algorithms are called ‘planning algorithms’ because they plan which action to take before actually doing anything
One example is value iteration
model free vs. model based RN
Model based is less computationally efficient
Model based is faster
Model based relies on having an accurate model of the world, which model free doesn’t
Model based can easily adapt if rewards/transition probabilities change. Model free cannot adapt without learning through trial and error
How do you combine model free vs. model based
The two systems model
Evidence shows that humans combine the two approaches