Week 8: Introduction to Reinforcement Learning Flashcards
Reinforcement learning deals with question of how
to make a decision in sequential problems like chess or solving a maze
Reinforcement learning is concerning how to learn to make a decision in sequential problems is difficult because of
temporal credit assignment problem as you don’t know which of your past actions was pivotal for a good outcome (e.g., winning a game of chess)
General setup model-free reinforcement learning (5)
An agent in an environment
Receives reward intermittently
Rewards are positive or negative, but not otherwise specific as in supervised learning
But, the agent doesn’t know the rules of the environment. It has no ’model’ of the environment or task.
Model of enviroment could be: rules of chess/stratgery of opponent
Model-free reinforcement learning is also called
trial and error learning
General setup for model-free RL: An agent in the enviroment would be robot and 3 things to this conceptual model of how an artifical agent may learn model-free reinforcement learning
- Policy P
- Reward
- State S
Diagram of model-free RL setup: agent robot, policiy, reward and state
Agent robot has policiy P
set of rules to determine which actions to take for different states of myself in the environment … such that I (hopefully) maximize my future reward
Reward R
where the enviroment provide reward which affects the state of agent
For a given policy, a value V can be associated
in a given state
At any given moment
the agent is in a given state
Value V of a given state is usually defined as
the sum of expected future reward for each possible state
We would write V like this:
The discount factor in formula of V
The further in future you expect the reward
then the less you value it
For a simple game like tic tac toe we might be able to write down
all the states
For chess, it is astronmical possibiltiies of
states , we can’t weite them all out
The goal of all reinforcement learning (esp model-free) is to (2)
optimise the policy to maximsie the future reward!
This is a hard problem! = chess vs tic tac toe
Toy example (3)
A robot is supposed to learn to kick a football
When the football is in the line of sight, it can correctly estimate the distance and direction to the ball => 2 numbers (distance and direction to ball = labels for our state)
We call this number pair the ‘state’ the robot is in. It completely characterises it’s ‘mental repertoire’.
In a real organisation,the ‘state’ might comprise all or a subset of the information that is momentarily coded in the brain such as
The emotional state, hunger, thirst, current thoughts, how tired it is …
A robot has its state and a set of possible actions
actions it can take (move up/move down/move left/move right/kick ball)
When the robot kicks the ball it (2)
gets a reward! (operant conditioning style)
up until then it does not get anything (temporal credit assignment = which of these ations lead up to kicking ball were good ones)
A reward for robot is
A number in memory it ‘wants’ to maximise
Learning algorthim says (3)
want to maximize reward.
Of course robot has no intrinsic desire here.
Our algorithm guides robot’s learning
Learning algorthim guides robot learning (4)
Learning epsiodes is where: evaluate own state, take action and observe reward (maybe robot gets reward or maybe it does not)
If reward is received, something about the enviroment has been learned! (these are intermittent signals)
If no reward, take another action
Say an learning episode lasts until the ball is kicked, then a reward is received.
All the actions together until we received some reward, is called a
learning episode
The football-kicking robot (5)
- Robot is in a given state
- Robot does not know which actions lead to reward
- Does not know what state leads to which other state
- Performs random actions initally
- Only learns at the end of each learning epsiode
First Learning Episode of Robot (7)
For all of these actions it receives no reward
Because the ball has not been kicked
Only when it kicks the ball it receives reward
At some random point in the future
Robot manages to kick the ball
Finally get’s reward
Robot has now learned,
When you are zero steps from the ball
=> kick!
Does not realise it has been here before (orange arrow)
At the end of the first learning episode, the robot has not learned anything else (4)
E.g., what if it’s in front of the ball and takes a step back?
For all it knows that could lead to bigger reward
Robot also doesn’t know, when it’s
1 step away from the field with the ball,Stepping into that field is a good idea
That can only be learned in the next learning episode!
Key concept in RL is that value (V)
Value V can be thought of as: (2)
of performing action (A) in a given state (S)
Value can be thought of as the current prediction of how much reward it will eventually obtain if given state S it performs action A and subsequent high value actions
The goal of RL is to learn values (2)
that are good predictions of upcoming reward
To learn values, takes many steps of trial and error
Robot has only learned about the last
state-action pair of kicking ball
We couldn’t have assigned reward to bunch of states before
due to supersition
Supersistion is (2)
Maybe I did some useless motions before (e.g. going back and forth 10 times).
If we assigned value to those actions, we would repeat them on the future, even though they are useless which have no casual impact on the outcome. This is called superstition.
Supersistion can be observed in animals
e.g., skinner box
We keep track of the profess of robot learning to kick ball in enviroment by using the
Q-table
For each action made a q-table which (2)
mimics the outline of the enviroment and assigns value for each action
e.g., kick
For Kick table, (4)
When in a certain field in enviroment we assign value 10 in a Q-Table KICK since it is the field where the ball is
Later on , robot in this area we n look up in KICK table the number
if it is non-zero it predicts reward for this action (here KICK)
So robot will kick the ball! = robot’s reward = no random action
The Q-table sumamrises the
Value of an action (here KICK) in a given state (red square)
Q-learning is a
model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent.
Second learning episode (3)
- This time we reached the ball from above
- We find ourselves in a field with ball and know to kick it because we written down in Q-table for KICK
- Random actions at beginning like kick when no ball as it is relying on Q-table until it goes to field with ball
- We can now assign a different value to state-action pair that led us to be in the field with the ball to kick it so by assigning a value of 8 (smaller than 10 in KICK table since one time step away from reward) in B7 in move-down Q-table
Q-Learning episode 3 (2)
Partial value ofis assigned to immediately preceeding state-action pair if one steps into field with previous learned value
So we assign value of 6 smaller to turn right
Trial and error part of the robot across Q-learning episodes get
shorter and shorter
We have a Q-table
for each possible action
There will be some actions where the robot only does once and never do them again as
trial and error through next episode the robot discovers a more direct path to the ball and thus Q-table updated
We bassicaly assign values to all these states for different actions (each have Q-table) , across Q-learning episodes, which lead to
route to the ball to kick it
After, robot learned route to go to ball and kick it, the robot can
just look up the actions in its Q-table
Robot can look up actions inQ-table can perform a
learned action sequence, does not need to rely on trial and error anymore
We can take all the values we assigned the state-action for robot for each Q-table for each possible action and to keep track of ‘quality of aciton pairs (2)
write each of them as a single column
When in field B7 (state) move down (action) had the highest value
What did we do when we eneted values in Q-table for example learning episode 2? (7)
In episode 1 we learned to kick in field C7. that gave us reward 10
State S = being in field C7
Action A = KICK
In the field above (B7), the best action was MOVE DOWN, (after trial and error)
we then assigned a fraction of the reward value from C7, (not 10 but 8)
which is a state we know what to do if we find ourselves there.
The value for Q(S,A) = Q(B7,MOVE DOWN) was zero prior to learning
What we did to change the values in the Q-table in ep 2 of changing value of B7 to move down to 8 in MOVE DOWN TABLE to update is using this change in Q formula:
Change in Q formula in words of updating values in Q-Table column (4)
We take the maximum Q (belonging to the best state-action pair) in the next state, and multiply it by a number (gamma) between 0 and 1 (discount factor, future reward counts less than instant reward).
The difference between this value and the current Q value (for the state I am in) is added to the reward I already know I receive in my current stat (rk).
The result is added to the table value for my current state-action pair (here B7,DOWN)
If I got more reward than expected I should increase my Q for the current state-action pair!
Q is the Quality function where
= the expected future reward given my state S and action A= current reward + discounted future reward expected for being in the next state
Q tells us the
’joint quality’ of taking an action A, given a state S
Q is slightly different from value V as it is
the value of the state you are in, assuming you take the best action from there on out
Q formula simply means
means probability of entering state S’ (given S,A)times the total reward. So a probability of transition of state times reward
In our toy robot example, We did not know probability of state transition (2)
= know how likely to go from one state to next
Do know it then model of enviroment (i.e.. rules of game) = model based reinforcement learning
Formula of Q means (2)
transitioning from state S to S’ is not completely deterministic as there is an element of randomness
In Q-learning we don’t know probabilities of state=transitions
Summary of Q-learning (4)
The agent has learned based on intermittent rewards
Equipped with the Q-table it can now navigate to the ball and kick it.
Note, this may apply to an agent moving in an environment (like a robot or a rat), but the idea’s can equally be applied to games (e.g., a state would be a given position in, say, chess or checkers, and the actions would be the permissible moves).
The path is not optimal, but if we introduce some randomness, it will discover the optimal path (add some “exploration”, do not always 100% “exploit” the first thing you learned to be positive! (i.e., do not exploit 100% your old Q-table)
Exploration vs Expolitation (2)
Do your parents always go to the same place to vacation? vs
Do you always order the same food at the restaurant?
Benefits of Q-learning (2)
Can learn complex behaviours
No need for an explicit teaching signal as in supervised learning and only intermittent reward
Downsides of Q-learning/RL (3)
Takes time, lot’s of trial an error. Especially if the state-space (number of possible state-action pairs) is large (e.g., as in chess)
Cannot apply in all types of situations: Can’t fall randomly off a million cliffs to learn optimal behaviour. Sounds funny until you consider self-driving cars!
Training in the real-world (e.g., a robot) would take a long time
Q-learning takes a lot of time, if state space is large but… (2)
But we can approximate the Q-table, approximate it with deep learning
Deep Q learning: e.g., Deepminds Atari game playing AI uses a DQN (Deep Q Network)
Change in Q formula is like an (2)
update equation
We update Q based on experience and change value in Q-table
S’ is
entering to next state