Lecture 17 - Reinforcement Learning Flashcards
Why is reinforcement learning an important part of AI?
Almost all “natural learning” is done by reinforcement
e.g. learning to read, play chess etc.
What are the properties of reinforcement learning?
Agent is learning to choose a sequence of actions
Ultimate consequences of an action may not be apparent until the end
When a reward is achieved it may not be due to the most recent action.
No predefined set of training samples/examples
What is the credit assignment problem?
When a reward is achieved it may not be due to the most recent action, but one performed earlier in the sequence.
Describe the components of a Markov Decision Process
Agent operates in a domain represented as a set of distinct states, S
Agent has a set of actions it can perform, A
Time advances in discrete steps
At time t the agent knows the current state st and must select an action to perform
When action at is performed the agent receives a reward rt which may be positive, negative or zero. Reward given depends on the current state and action so can be determined by a reward function R: rt = R(st, at)
New state st+1 depends on the last state and action, so can be determined by a transition function T: st+1 = T(st, at)
What does an agent in a Markov Decision Process acquire?
A control policy; i.e. a function that determines the best action given a current state
Describe the “immediate reward” strategy for determining the best action in a Markov Decision Process, and why it is/isn’t usually used
Choosing the action with the highest immediate reward
Produces a good short term payoff but might not be optimal in the long run
Describe the “total payoff” strategy for determining the best action in a Markov Decision Process, and why it is/isn’t usually used
Maximise the total payoff by choosing a sequence of states that has a large sum of rewards
Not realistic because it will consider a reward in the very distant future just as valuable as one received immediately which is not usually the case
Describe the “discounted cumulative reward” strategy for determining the best action in a Markov Decision Process, and why it is/isn’t usually used
Same as total payoff except distant rewards are worth less than more immediate ones
What is the learning task in Markov Decision Processes?
To discover the optimal control policy, i.e. the best action for each state
If the agent in a Markov Decision Process knows the transition function, the reward function and the discounted value of each state then V* can be used as
an evaluation function for actions
If an agent in a Markov decision process does not know T or R, no form of evaluation function that requires _____________ is possible
looking ahead
What is the Q function?
An evaluation function of both state and function that estimates the total payoff from choosing a particular action
What are two possible Action Selection strategies in Markov Decision Processes?
Uniform Random Selection
Select Highest Expected Cumulative Reward
What is the advantage and disadvantage of using uniform random selection in Markov Decision Processes?
Advantage: Will explore entire state space and hence satisfy convergence theorem
Disadvantage: May spend a great deal of time learning the value of transitions that are not optimal
What is the advantage and disadvantage of Selecting the Highest Expected Cumulative Reward as the action selection strategy in markov decision processes?
Advantage: Concentrates resources on apparently useful transitions
Disadvantage: May ignore even better pathways which haven’t been explored, and does not satisfy convergence theorem