Lecture 22 Flashcards
How do reinforcement learning problems mainly differ from sequential decision problems?
No Markovian transition model. Agent has to deduce optimal policy from rewards. Via trial and error.
What are two key parts about reinforcement learning and the environment? How is this done?
Agent needs to exploit best actions (as dictated by policy) but also needs to explore to find other possible strategies (don’t always take best action).
The optimal choice has the largest chance of being picked, but other options can occur, with higher randomness early in the training.
What do we compute instead of state utility for reinforcement learning?
The action-utility representation: For each state there’s an array of values corresponding to actions, policy picks the best value.
What is temporal difference learning?
Q value for a state and an action is gotten by taking the reward from the current state and adding it to the max Q value of the resulting state from the taken action.
This must be done gradually to ensure stochastic properties of environment come through.
What can we do if there are too many states in a reinforcement learning problem?
Use a function approximator such as a neural network to map states to actions.
What is on-policy and off-policy?
Updates in Q-learning use value of best action from next state- on-policy algorithm
Update Q values based on value of action taken, rather than best possible.