7 - Reinforcement Learning Flashcards
4 things RL is built on
- A policy
- A reward
- A value function
- A model of the environemtn
Policy
Defines agent’s way of behaving.
Maps from states to probabilities of selecting each action
If agent follows policy π at time t, then π(a|s) is the probability that At=a given St=s
Reward signal
Defines the goal
Value function
SPecifies what is good in the long term
a state s under policy π denoted vπ(s) is the expected return when starting in s and following π thereafter.
Rewards
Immediate desirability of a stateV
Values
Long term desirability of a state
Model
Predicts or simulates environment.
Model based: similar to planning
Model free RL: trial-and-error
Animal Learning
Behaviours that lead to reward reinforced. Behaviours that do not lead to reards are abandoned/reduced
Dynamic Programming
Always remember answers to sub problems you solved already
Temporal Difference
One stimulus, the secondary reinforcer, predicts arrival of a primary reinforcer.
Eg time.
Multi Armed bandit Problems
- Choose among k options
- After each choice, you receive a numerical reward (based on the choice)
- Maximise the reward over some time period (eg 1000 actions)
N-armed bandit problem
Each n has an expected reward. call value q
then the value of an action a is the expected reward for a
…. (if we know/don’t know)
If we knew q*(a) (q star subscript not q multiply a) then q(a) is the exptected value of reward Rt given action At
If we don’t then the task is to estimate
Greedy actions
Go for the greatest Qt(a).
argmax a Qt(a)
Exploitation
Non-Greedy Actions
Choose something else other than max Qt(a).
Exploration
Natural way to estimate q(a) by averaging the rewards actually received (Think of sample averages)
Qt(a) = (sum of rewards when a taken prior to t)/(number of times a taken prior to t)
Sample average