13 Reinforcement Learning Flashcards
Reinforcement learning
Reinforcement learning (RL) is unsupervised learning: The agent receives no examples, and starts with no model or utility information. The agent must use trial-and-error, and receives rewards, or reinforcement, to guide learning.
Examples:
• Learning a game by making moves until lose or win: Reward only at the end
• Learning to ride a bicycle without any assistance: Rewards received more frequently
RL can be seen to encompass all of AI: An agent must learn to behave in an unknown environment
Variations of RL
Accessible environment (agent can use percepts) vs. inaccessible (must have some model).
Agent may have some initial knowledge, or not have any domain model.
Rewards can be received only in terminal states, or in any state.
Rewards can be part of the actual utility, or just hint at the actual utility.
The agent can be a passive (watching) or an active (exploring) learner.
RL uses results from sequential decision processes.
Sequential decision processes
In a sequential decision process, the agent’s utility depends on a sequence of decisions. Such problems involve utility (of states) and uncertainty (of action outcomes).
The agent needs a policy that tells it what to do in any state it might reach: a = π(s). An optimal policy π*(s) is a policy that gives the highest expected utility.
Example
- 4x3 environment
- Agent starts in (1, 1)
- Probabilistic transition model
- Sequence [Up, Up, Right, Right, Right] only has a probability of 0.85 = 0.33 of reaching +1(4, 3)
- Terminal rewards are +1(4, 3) and 1(4, 2)
Other state rewards are 0.04
Markov Decision Processes (MDP)
Markov Decision Processes (MDP)
An MDP is a sequential decision process with the following characteristics:
- Fully observable environment (agent knows where it is at any time)
- State transitions are Markovian, i.e. P(s’ | s,a) (probability of reaching s’ from s by action a) depends only on s and a, not earlier state history
- Agent receives a reward R(s) in each state s
- The total utility U(s) of s is the sum of the rewards received, from s until a terminal state is reached
Optimal policy and utility of states
The utility of a state s depends on rewards received but also on the policy π followed:
Uπ(s) = E[Sum{ R(S_t) }]
Of all the possible policies the agent could follow, one gives the highest expected utility:
π*(s) = argmaxπ Uπ(s)
This is the optimal policy. Under certain assumptions, it is independent of starting state. For an MDP with known transition model, reward function and assuming the optimal policy, we can calculate the utility U(s) of each state s.
Knowing U (s) allows the agent to select the optimal action:
Bellman equations
We need to be able to calculate utilities U(s) in order to define optimal policy. Can exploit dependence between states: The utility of a state s is the reward R(s) plus the maximum expected utility of the next state [image].
This is the Bellman equation. There are n equations for n states, containing utilities U(s) as n unknowns. Solving the equations yields the utilities U(s).
Value iteration
The Bellman equations are nonlinear (due to the max opr.) and cannot be solved by linear algebra. Can use an iterative approach instead
• Start with arbitrary initial values
• Calculate right hand side
• Plug the value into left-hand sides U(s)
• Iterate until the values stabilize (within a margin)
This VALUE-ITERATION algorithm is guaranteed to converge and to produce unique solutions.
Policy iteration
Instead of iterating to find U(s) and derive an optimal policy, we can iterate directly in policies. We can iterate the policy to get an optimal one:
Policy evaluation: Given a policy ⇡i, calculate Ui, the utility of each state if ⇡i is followed
Policy improvement: Calculate new policy ⇡i+1 that selects the action that maximizes successor state value (MEU).
Repeat until values no longer change
This POLICY-ITERATION algorithm is guaranteed to converge and to produce an optimal policy
Reinforcement learning of MDP
We could find an optimal policy for an MDP if we know the transition model P(s’ | s,a). However, an agent in an unknown environment does not know the transition model nor in advance what rewards it will get in new states.
We want the agent to learn to behave rationally in an unsupervised process. The purpose of RL is to learn the optimal policy based only on received rewards.
Different RL agent designs
• Utility-based agents learn a utility function on states and uses it to select actions that maximize expected utility. Requires also a model of where actions lead
- Q-learning agents learn an action-utility function (Q-function), giving expected utility of taking a given action in a given state. Can select actions without knowing where they lead, at the expense not being able to look ahead.
- Reflex agents learn a policy that maps directly from states to actions, i.e. π*
Direct utility estimation
In passive learning, the agent’s policy is fixed, it only needs to how good it is. Agent runs a number of trials, starting in (1, 1) and continuing until it reaches a terminal state. The utility of a state is the expected total remaining reward (reward-to-go). Each trial provides a sample of the reward-to-go for each visited state. The agent keeps a running average for each state, which will converge to the true value. This is a direct utility estimation method.
Direct utility estimation converges slowly.
Exploiting state dependencies
Direct utility fails to exploit the fact that states are dependent as shown by Bellman equations.
Learning can be speeded up by using these dependencies.
Direct utility estimation can be seen to search a too large hypothesis space that contains many hypotheses violating Bellman equations
Adaptive Dynamic Programming (ADP)
An ADP agent uses dependencies between states to speed up value estimation. It follows a policy π and can use observed transitions to incrementally build the transition model P(s’ | s, π(s)). It can then plug the learned transition model and observed rewards R(s) into the Bellman equations to get U(s). The equations are linear because there is no max operator, and therefore easier to solve. The result is U(s) for the given policy π.
Temporal Difference (TD) learning
TD is another passive utility value learning algorithm using Bellman equations. Instead of solving the equations, TD uses the observed transitions to adjust the utilities of the observed states to agree with Bellman. TD uses a learning rate parameter α to select the rate of change of utility adjustment. TD does not need a transition model to perform its updates, only the observed transitions.
Passive learning
Direct utility estimation
Adaptive Dynamic Programming
Temporal Difference learning
Active learning
While a passive RL agent executes a fixed policy π, an active RL agent has to decide which actions to take. An active RL agent is an extension of a passive one, e.g. the passive ADP agent, and adds:
• Needs to learn a complete transition model for all actions (not just π), using passive ADP learning
• Utilities need to reflect the optimal policy π*, as expressed by the Bellman equations
• Equations can be solved by the VALUE-ITERATION or POLICY-ITERATION methods described before
• Action to be selected as the optimal/maximizing one