week 8 - reinforcement learning Flashcards

Question 1

Q

what is the difference between supervised learning, unsupervised learning and reinforcement learning?

Answer

A

supervised = learns a mapping between data and labels

unsupervised = discovers patterns in the data

reinforcement = a form of supervised learning that learns based on a reward signal

Question 2

Q

what is a pro of reinforcement learning?

Answer

A

it can succeed in solving very complex problems that other ML models cant

e.g it can be trained to play games like Go,

it can also be used to explain human learning. Lots of evidence suggests that components of reinforcement learning algorithms appear to be represented in the brain

Question 3

Q

how do we represent a reinforcement learning problem?

Answer

A

As a markov decision process

MDP have 4 components:
State (any concrete position in some sort of physical or abstract space)
Action (any actions we can take)
Transition probabilities (the probabilitity of transitioning to another state given a particular action)
Reward (the reward recieved when taking each action in each state)

the goal of the MDP is to find the best way to act (optimal policy)

Question 4

Q

what can be represented as a markov decision process?

Answer

A

basically anything

it can be used in motion control problems, it can be used in social interactions, it can be used in games like GO

Question 5

Q

what is model free reinforcement learning and pros and cons?

Answer

A

the process of learning which actions produce the most reward, through trial and error, without actualy needing a model’ of the dynamics of the environment.

All the learning goes off is consequences alone.

This is very computationally efficient, however it is slow because it relies on trial and error, and inflexible because if the world changes then the algorithm doesn’t work any more

Question 6

Q

what is the sarsa algorithm?

Answer

A

SARSA: State-Action-Reward-State-Action, an on-policy reinforcement learning algorithm.
Model-free: Does not require knowledge of the environment’s dynamics.

Its a type of temporal difference learning

SARSA learns the value of actions based on the reward it recieves. It refers to the values of actions as Q

In each step (t), it learns the Q values of each action in each state. It compares the reward gained in each step, from the reward it had previously (prediction error). As the model learns, it updates the Q values

The prediction error gets weighted by a learning rate,

So essentially, in each state we update the value of a particular action in a particular state by multiplying the expected Q value based on previously, with the prediction error x learning rate, and with the future Q value of the future action

As expected Q gets more accurate, the prediction error will decrease and the updating will converge onto the optimal policy

Question 7

Q

What is a con of markovian ML with immediate rewards? how does sarsa mitigate this?

Answer

A

It only tells us about immediate rewards

It doesn’t model actions based on long term rewards.

To mitigate this, SARS encorporate future rewards into the algorithm. This means we don’t just learn from reward, but also from the expected Q values of the next actions we take. This means that future rewards can propogate backwards

This is refered to as temporal difference prediction error. It takes into account temporal differences as well as reward based prediction error.

This allows us to learn action values whilst accounting for long term rewards. This algorithm is recursive becasue the actions that we learn feed back into the updating of other action values

Question 8

Q

What makes a decision process markovian?

Answer

A

Any action that we decide to take is entirely dependent on where we are now, not on any previous state

Question 9

Q

what is bootsrapping in RL?

Answer

A

learning values based on other learnt values

In RL, the learned action values feedback into other learnt action values.This allows us to learn which algorithms lead to the best rewards

Question 10

Q

what does SARSA tell us about how humans leaern

Answer

A

Dopamine has been proposed to signal the temporal difference prediction error

Dopamine neurons fire more when theres an unexpected reward

Question 11

Q

what is the explore/exploit problem? how to mitigate

Answer

A

Should an agent exploit what it has learnt to keep getting a reasonable amount of reward

Or should it explore other options in the hope of learning how to gain more reward

You can solve this by implementing a decision rule that allows for exploration. This ensures that sometimes we will choose other options, rather than just the policy that chooses the most reward as we have learnt so far

Question 12

Q

what is the epsilon greedy rule?

Answer

A

A decision rule that ensures we will sometimes explore rather than exploit what we already know

The rule says you choose the best option most of the time but sometimes you choose randomly

The E parameter determines what percentage of the time you choose randomly

Question 13

Q

what is model-based reinforcement learning?

Answer

A

Using a model of the environment to determine which actions are best

This usually uses a simulation process

This removes the need for trial and error, as the model has knowledge of the environment and knows where rewards lie

These algorithms are called ‘planning algorithms’ because they plan which action to take before actually doing anything

One example is value iteration

Question 14

Q

model free vs. model based RN

Answer

A

Model based is less computationally efficient
Model based is faster
Model based relies on having an accurate model of the world, which model free doesn’t
Model based can easily adapt if rewards/transition probabilities change. Model free cannot adapt without learning through trial and error

Question 15

Q

How do you combine model free vs. model based

Answer

A

The two systems model

Evidence shows that humans combine the two approaches

Question 16

Q

can you combine RL with DL?

Answer

Study These Flashcards

A

yes, e.g in the deep Q learning network,

or
neural networks can capture the rules of human reinforecement learning in a data driven way.
These RL models build on networks that account for relationships across time, such as RNN’s. These studies have suggested that humans learn in ways that isn’t quite what we expected

Question 17

Q

Answer

Study These Flashcards

A

week 8 - reinforcement learning Flashcards

(17 cards)