Week 8: Introduction to Reinforcement Learning Flashcards
Reinforcement learning deals with question of how
to make a decision in sequential problems like chess or solving a maze
Reinforcement learning is concerning how to learn to make a decision in sequential problems is difficult because of
temporal credit assignment problem as you don’t know which of your past actions was pivotal for a good outcome (e.g., winning a game of chess)
General setup model-free reinforcement learning (5)
An agent in an environment
Receives reward intermittently
Rewards are positive or negative, but not otherwise specific as in supervised learning
But, the agent doesn’t know the rules of the environment. It has no ’model’ of the environment or task.
Model of enviroment could be: rules of chess/stratgery of opponent
Model-free reinforcement learning is also called
trial and error learning
General setup for model-free RL: An agent in the enviroment would be robot and 3 things to this conceptual model of how an artifical agent may learn model-free reinforcement learning
- Policy P
- Reward
- State S
Diagram of model-free RL setup: agent robot, policiy, reward and state
Agent robot has policiy P
set of rules to determine which actions to take for different states of myself in the environment … such that I (hopefully) maximize my future reward
Reward R
where the enviroment provide reward which affects the state of agent
For a given policy, a value V can be associated
in a given state
At any given moment
the agent is in a given state
Value V of a given state is usually defined as
the sum of expected future reward for each possible state
We would write V like this:
The discount factor in formula of V
The further in future you expect the reward
then the less you value it
For a simple game like tic tac toe we might be able to write down
all the states
For chess, it is astronmical possibiltiies of
states , we can’t weite them all out
The goal of all reinforcement learning (esp model-free) is to (2)
optimise the policy to maximsie the future reward!
This is a hard problem! = chess vs tic tac toe
Toy example (3)
A robot is supposed to learn to kick a football
When the football is in the line of sight, it can correctly estimate the distance and direction to the ball => 2 numbers (distance and direction to ball = labels for our state)
We call this number pair the ‘state’ the robot is in. It completely characterises it’s ‘mental repertoire’.
In a real organisation,the ‘state’ might comprise all or a subset of the information that is momentarily coded in the brain such as
The emotional state, hunger, thirst, current thoughts, how tired it is …
A robot has its state and a set of possible actions
actions it can take (move up/move down/move left/move right/kick ball)
When the robot kicks the ball it (2)
gets a reward! (operant conditioning style)
up until then it does not get anything (temporal credit assignment = which of these ations lead up to kicking ball were good ones)
A reward for robot is
A number in memory it ‘wants’ to maximise
Learning algorthim says (3)
want to maximize reward.
Of course robot has no intrinsic desire here.
Our algorithm guides robot’s learning
Learning algorthim guides robot learning (4)
Learning epsiodes is where: evaluate own state, take action and observe reward (maybe robot gets reward or maybe it does not)
If reward is received, something about the enviroment has been learned! (these are intermittent signals)
If no reward, take another action
Say an learning episode lasts until the ball is kicked, then a reward is received.
All the actions together until we received some reward, is called a
learning episode
The football-kicking robot (5)
- Robot is in a given state
- Robot does not know which actions lead to reward
- Does not know what state leads to which other state
- Performs random actions initally
- Only learns at the end of each learning epsiode