Week 8: Introduction to Reinforcement Learning Flashcards

1
Q

Reinforcement learning deals with question of how

A

to make a decision in sequential problems like chess or solving a maze

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reinforcement learning is concerning how to learn to make a decision in sequential problems is difficult because of

A

temporal credit assignment problem as you don’t know which of your past actions was pivotal for a good outcome (e.g., winning a game of chess)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

General setup model-free reinforcement learning (5)

A

An agent in an environment

Receives reward intermittently

Rewards are positive or negative, but not otherwise specific as in supervised learning

But, the agent doesn’t know the rules of the environment. It has no ’model’ of the environment or task.

Model of enviroment could be: rules of chess/stratgery of opponent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Model-free reinforcement learning is also called

A

trial and error learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

General setup for model-free RL: An agent in the enviroment would be robot and 3 things to this conceptual model of how an artifical agent may learn model-free reinforcement learning

A
  1. Policy P
  2. Reward
  3. State S
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Diagram of model-free RL setup: agent robot, policiy, reward and state

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Agent robot has policiy P

A

set of rules to determine which actions to take for different states of myself in the environment … such that I (hopefully) maximize my future reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Reward R

A

where the enviroment provide reward which affects the state of agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

For a given policy, a value V can be associated

A

in a given state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

At any given moment

A

the agent is in a given state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Value V of a given state is usually defined as

A

the sum of expected future reward for each possible state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

We would write V like this:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The discount factor in formula of V
The further in future you expect the reward

A

then the less you value it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

For a simple game like tic tac toe we might be able to write down

A

all the states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For chess, it is astronmical possibiltiies of

A

states , we can’t weite them all out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The goal of all reinforcement learning (esp model-free) is to (2)

A

optimise the policy to maximsie the future reward!

This is a hard problem! = chess vs tic tac toe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Toy example (3)

A

A robot is supposed to learn to kick a football

When the football is in the line of sight, it can correctly estimate the distance and direction to the ball => 2 numbers (distance and direction to ball = labels for our state)

We call this number pair the ‘state’ the robot is in. It completely characterises it’s ‘mental repertoire’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

In a real organisation,the ‘state’ might comprise all or a subset of the information that is momentarily coded in the brain such as

A

The emotional state, hunger, thirst, current thoughts, how tired it is …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A robot has its state and a set of possible actions

A

actions it can take (move up/move down/move left/move right/kick ball)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When the robot kicks the ball it (2)

A

gets a reward! (operant conditioning style)

up until then it does not get anything (temporal credit assignment = which of these ations lead up to kicking ball were good ones)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A reward for robot is

A

A number in memory it ‘wants’ to maximise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Learning algorthim says (3)

A

want to maximize reward.

Of course robot has no intrinsic desire here.

Our algorithm guides robot’s learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Learning algorthim guides robot learning (4)

A

Learning epsiodes is where: evaluate own state, take action and observe reward (maybe robot gets reward or maybe it does not)

If reward is received, something about the enviroment has been learned! (these are intermittent signals)

If no reward, take another action

Say an learning episode lasts until the ball is kicked, then a reward is received.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

All the actions together until we received some reward, is called a

A

learning episode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The football-kicking robot (5)

A
  • Robot is in a given state
  • Robot does not know which actions lead to reward
  • Does not know what state leads to which other state
  • Performs random actions initally
  • Only learns at the end of each learning epsiode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

First Learning Episode of Robot (7)

A

For all of these actions it receives no reward

Because the ball has not been kicked

Only when it kicks the ball it receives reward

At some random point in the future
Robot manages to kick the ball

Finally get’s reward

Robot has now learned,
When you are zero steps from the ball
=> kick!

Does not realise it has been here before (orange arrow)

27
Q

At the end of the first learning episode, the robot has not learned anything else (4)

A

E.g., what if it’s in front of the ball and takes a step back?

For all it knows that could lead to bigger reward

Robot also doesn’t know, when it’s
1 step away from the field with the ball,Stepping into that field is a good idea

That can only be learned in the next learning episode!

28
Q

Key concept in RL is that value (V)

Value V can be thought of as: (2)

A

of performing action (A) in a given state (S)

Value can be thought of as the current prediction of how much reward it will eventually obtain if given state S it performs action A and subsequent high value actions

29
Q

The goal of RL is to learn values (2)

A

that are good predictions of upcoming reward

To learn values, takes many steps of trial and error

30
Q

Robot has only learned about the last

A

state-action pair of kicking ball

31
Q

We couldn’t have assigned reward to bunch of states before

A

due to supersition

32
Q

Supersistion is (2)

A

Maybe I did some useless motions before (e.g. going back and forth 10 times).

If we assigned value to those actions, we would repeat them on the future, even though they are useless which have no casual impact on the outcome. This is called superstition.

33
Q

Supersistion can be observed in animals

A

e.g., skinner box

34
Q

We keep track of the profess of robot learning to kick ball in enviroment by using the

A

Q-table

35
Q

For each action made a q-table which (2)

A

mimics the outline of the enviroment and assigns value for each action

e.g., kick

36
Q

For Kick table, (4)

A

When in a certain field in enviroment we assign value 10 in a Q-Table KICK since it is the field where the ball is

Later on , robot in this area we n look up in KICK table the number

if it is non-zero it predicts reward for this action (here KICK)

So robot will kick the ball! = robot’s reward = no random action

37
Q

The Q-table sumamrises the

A

Value of an action (here KICK) in a given state (red square)

38
Q

Q-learning is a

A

model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent.

39
Q

Second learning episode (3)

A
  • This time we reached the ball from above
  • We find ourselves in a field with ball and know to kick it because we written down in Q-table for KICK
  • Random actions at beginning like kick when no ball as it is relying on Q-table until it goes to field with ball
  • We can now assign a different value to state-action pair that led us to be in the field with the ball to kick it so by assigning a value of 8 (smaller than 10 in KICK table since one time step away from reward) in B7 in move-down Q-table
40
Q

Q-Learning episode 3 (2)

A

Partial value ofis assigned to immediately preceeding state-action pair if one steps into field with previous learned value

So we assign value of 6 smaller to turn right

41
Q

Trial and error part of the robot across Q-learning episodes get

A

shorter and shorter

42
Q

We have a Q-table

A

for each possible action

43
Q

There will be some actions where the robot only does once and never do them again as

A

trial and error through next episode the robot discovers a more direct path to the ball and thus Q-table updated

44
Q

We bassicaly assign values to all these states for different actions (each have Q-table) , across Q-learning episodes, which lead to

A

route to the ball to kick it

45
Q

After, robot learned route to go to ball and kick it, the robot can

A

just look up the actions in its Q-table

46
Q

Robot can look up actions inQ-table can perform a

A

learned action sequence, does not need to rely on trial and error anymore

47
Q

We can take all the values we assigned the state-action for robot for each Q-table for each possible action and to keep track of ‘quality of aciton pairs (2)

A

write each of them as a single column

When in field B7 (state) move down (action) had the highest value

48
Q

What did we do when we eneted values in Q-table for example learning episode 2? (7)

A

In episode 1 we learned to kick in field C7. that gave us reward 10

State S = being in field C7

Action A = KICK

In the field above (B7), the best action was MOVE DOWN, (after trial and error)

we then assigned a fraction of the reward value from C7, (not 10 but 8)

which is a state we know what to do if we find ourselves there.

The value for Q(S,A) = Q(B7,MOVE DOWN) was zero prior to learning

49
Q

What we did to change the values in the Q-table in ep 2 of changing value of B7 to move down to 8 in MOVE DOWN TABLE to update is using this change in Q formula:

A
50
Q

Change in Q formula in words of updating values in Q-Table column (4)

A

We take the maximum Q (belonging to the best state-action pair) in the next state, and multiply it by a number (gamma) between 0 and 1 (discount factor, future reward counts less than instant reward).

The difference between this value and the current Q value (for the state I am in) is added to the reward I already know I receive in my current stat (rk).

The result is added to the table value for my current state-action pair (here B7,DOWN)

If I got more reward than expected I should increase my Q for the current state-action pair!

51
Q

Q is the Quality function where

A

= the expected future reward given my state S and action A= current reward + discounted future reward expected for being in the next state

52
Q

Q tells us the

A

’joint quality’ of taking an action A, given a state S

53
Q

Q is slightly different from value V as it is

A

the value of the state you are in, assuming you take the best action from there on out

54
Q

Q formula simply means

A

means probability of entering state S’ (given S,A)times the total reward. So a probability of transition of state times reward

55
Q

In our toy robot example, We did not know probability of state transition (2)

A

= know how likely to go from one state to next

Do know it then model of enviroment (i.e.. rules of game) = model based reinforcement learning

56
Q

Formula of Q means (2)

A

transitioning from state S to S’ is not completely deterministic as there is an element of randomness

In Q-learning we don’t know probabilities of state=transitions

57
Q

Summary of Q-learning (4)

A

The agent has learned based on intermittent rewards

Equipped with the Q-table it can now navigate to the ball and kick it.

Note, this may apply to an agent moving in an environment (like a robot or a rat), but the idea’s can equally be applied to games (e.g., a state would be a given position in, say, chess or checkers, and the actions would be the permissible moves).

The path is not optimal, but if we introduce some randomness, it will discover the optimal path (add some “exploration”, do not always 100% “exploit” the first thing you learned to be positive! (i.e., do not exploit 100% your old Q-table)

58
Q

Exploration vs Expolitation (2)

A

Do your parents always go to the same place to vacation? vs
Do you always order the same food at the restaurant?

59
Q

Benefits of Q-learning (2)

A

Can learn complex behaviours
No need for an explicit teaching signal as in supervised learning and only intermittent reward

60
Q

Downsides of Q-learning/RL (3)

A

Takes time, lot’s of trial an error. Especially if the state-space (number of possible state-action pairs) is large (e.g., as in chess)

Cannot apply in all types of situations: Can’t fall randomly off a million cliffs to learn optimal behaviour. Sounds funny until you consider self-driving cars!

Training in the real-world (e.g., a robot) would take a long time

61
Q

Q-learning takes a lot of time, if state space is large but… (2)

A

But we can approximate the Q-table, approximate it with deep learning

Deep Q learning: e.g., Deepminds Atari game playing AI uses a DQN (Deep Q Network)

62
Q

Change in Q formula is like an (2)

A

update equation

We update Q based on experience and change value in Q-table

63
Q

S’ is

A

entering to next state