Markov Decision Process Flashcards
What are Markov Decision Processes?
Is a dynamic system whose future probabilistic behaviour depends on the (1)present state & (2)the decision taken, which is subject to some uncertainty
- process predict how a process is going to evolve, but we don’t have control over that.
- dynamic system
- evolving probabilistically, according to present state
- further behaviour depends only on present state
- unlike MC we need to make a decision, so we have the action description to make the decision.
- the reward we get is dependant on the action and decision chosen.
- TP: depends on both the state and the action.
We only consider the station MDP
In MC we don’t have maximise k while K is belong to Ki, because in Markov Chain we don’t need to make decisions.
Elements of a Markov Decision Processes
- State Description: similar to Markov Chain
- Action Description: record decision description
- Immediate Reward: generated on action choice
For each reward is received when process is in state i and action k - State Transitions: transition probabilities
For each the probability of process making a transition to state k when process is in state i and action k is chosen. - Optimality Equation: Unlike MC, depend on action
Depends on 1. Decision Criteria & 2. Planning Horizon
Value Function :
V^ni = enables us to maximise the expected total reward
- Expected total reward over finite horizon
Optimality Equation is the expected total reward over a finite horizon (choosing decision to maximise value).
V^ni = maximum expected total reward over n steps when the process is in state i initially.
- Expected discounted reward over finite horizon (+ B eta)
We should consider the discount factor beta. The reward received of one pound received in period 1 will be worse beta pounds than today.
V^ni = maximum discount reward over N steps when the process is in state i initially
- V0 = 0
- Expected discounted reward over infinite horizon
In many cases there will be a stationary optimal policy and so with a finite planning horizon, the policy that can be recommended is often a stationary one.
Image you are in the beginning of the planning horizon and the process is in state 1, then you will essentially have to make a decision that somehow balance the immediate reward and the expected future rewards. The future reward is still over an infinite planning horizon, it seems sensible to have the optimal policy as the stationary policy.
Guaranteed to find a stationary optimal policy if we are trying to maximise the expected discounted reward over an infinite planning horizon.
Vni : maximum expected total reward over the next n weeks when the process is in state i, initially.
Find the limit of Vni/n tends to infinity.
You have to do a-lot of calculations to understand that the average reward per step doesn’t depend on the starting value.
Initial decreases are big, then the decreases decrease by smaller amounts.
Policy for MDP
A policy for a MDP is a rule that prescribes the action to take in each state at each step.
A rule that will prescribe the action for all possible states at each steo
Policy for MDP
A policy for a MDP is a rule that prescribes the action to take in each state at each step.
A rule that will prescribe the action for all possible states at each step.
A decision maker can use such a rule to determine which action to take based on where they currently are in the planning horizon and which state the process is in.
Aim of the Policy: (Criterion)
- maximum expected total reward
- maximum expected discounted reward
- maximum long-run average reward per step.
Policy for MDP
A policy for a MDP is a rule that prescribes the action to take in each state at each step.
A rule that will prescribe the action for all possible states at each step.
A decision maker can use such a rule to determine which action to take based on where they currently are in the planning horizon and which state the process is in.
Aim of the Policy: (Criterion)
- maximum expected total reward
- maximum expected discounted reward
- maximum long-run average reward per step.
Once we fix a policy, we basically get a Markov Chain, so then we can find the long run distribution of the state and ensure that the long run average reward is as such.
Stationary Policy
In a stationary Markov Decision Process, the process, the states, actions, immediate rewards and transaction probabilities are the same at every stage of the process.
Although the MDP is stationary, the states depend on time and the rewards, action and transition probabilities do not depend on time.
- Is a policy that prescribes a single action whenever the process is in state i.
- Take the same action for any time in planing horizon.
If we are implementing a stationary policy on a process, the process evolves is then determined by a Markov Chain. As by specifying the decision taken in each stage, we will specify the transition probabilities for each state.
We can see how the process is moving between the states over time. We can say something about the properties of how the process will evolve based on any methods we were using for Markov Chains.
Infinite Planning Horizon
Argue that they might reasonably provide a good policy for a planning horizon which is uncertain in length or is a very long planning horizon.
It is not certain that policy that we obtain will be applied forever. If there is a change to the underlying situation then we would recalculate the other optimal policy.
So very often that we would resolve the problem, assuming an infinite planning horizon. We may get a stationary policy that is quite convenient to apply. But we don’t apply that for the first few decisions and then we would want to re-examine the situation. We re- optimise the problem.
In general, expected total reward is not suitable as a decision criterion.
Suppose that rki>0 for all i & k. In each step we get a positive reward, if we add all the reward up then we get an infinite total reward. In such situations, any decision over an infinite period, the total reward will be infinite. As a result, whether they are good or bad policy they would be seen as the same, this would make it hard to make a decision on which policy is better. The problem is that expected total reward not being suitable for the infinite planning horizon.
This problem can be overcome through
- Long run average reward per step
- Expected discounted reward
- Long run average reward per step
Vni= maximum expected total reward over n steps when the process is in state i initially.
Maximum long-run average reward per step when the process is on state i initially
= limit of Vni/n tends to infinity.
In many cases the long run average reward per step is independent of the starting state. We find that the long run average reward per step doesn’t depend on the starting state. This is similar to the long run distribution of the state of Markov Chain.
This decision will influence the transition probabilities. So it is tricker to specify conditions where this doesn’t depend on the starting state.
If you use such a criterion and the policy is really influenced by the long-run behaviour. So the emphasis is on the distant future, then the policy may take decisions which are relatively more expensive in the short run to in order to get into a good position to take an advantage of that in the long term.
If a decision maker does want such a policy then we should use the long run average reward per step. This criterion is based on behaviour of future so the early rewards doesn’t matter at all.
if the decision maker wants to only implement the policy for a few periods then this may not be a sensible criterion to use as your decision can be very expensive.
- Expected discounted reward
over an infinite horizon
Highest Reward is C
Worst Reward is -C
Rewards are bounded within this
- This means that maximum infinite horizon expected discounted reward is finite.
- we apply the discount factors
Vni= maximum expected discounted reward over an infinite horizon when the process is in state i initially
Let vi be the expected discounted reward over an infinite horizon using the stationary policy (delta) and starting in state i
Find a policy to maximise the Optimality E
Optimality Equation: (|remove n-1), it is no longer an iterative process. In the finite case, we can specify v0 to find v1, but here we are trying to find vi.
( It gets more difficult to solve)
In contrast with the long run average reward per step, with this criterion the expected discounted reward over an infinite horizon, it is clear that the early reward are the most important ones.
How you define early?
Depends on the value of beta, as beta gets closer to 1, the early rewards present for longer. If beta closer to 0, or gets smaller, the shorter time determines the rewards.
Delta is called the delta-optimal policy.
2 possible method to solve the optimal value 1. Value iteration, 2. Policy Iteration
Value Iteration - Successive Approximations
Coursework
Simplest
To solve a finite planning horizon but solve it repeatedly for longer and longer planning horizon. You solve one step problem and then extend that 2-step problem, to a 3-step problem and keep going.
Step 1: Initialising: Set V0i = 0 for all states
Step 2: Iteration - Calculate for finite stationary discounted MDP. Optimality Equation. We keep a note of the action that will maximise the value of Vni
Step 3: Stopping Criterion: If stopping criteria is not satisfied n=n+1 and repeat from step 2. Increasing the value of n will be very close to the optimal value, that moment the policy we get will be very close to the stationary optimal policy that we get.
Issue: How many iterations do we have to do before we can be satisfies with the policy? Or at-least a good approximation of the optimal policy.
Solution: Bounds on value iterations. The values calculated at each iteration of the value iteration algorithm can be used to obtain bounds on the optimal values of the infinite horizon policy.
- the result is the maximum absolute differences between the optimal value and the optimal values that we calculated in iterations n in the process.
- tells us how close we are the optimal value, En - the error at the last iteration
- Delta is 2En
- En = 0 (within 0.0005 of the optimal value)
Policy Iteration
- Find a stationary policy
- Try to improve the policy
- Evaluate and if you can improve, then repeat
Step 1: Initialisation, Start the process with a stationary policy. (Delta 1) and let n =1
Step 2: Policy evaluation. For all i calculate Vi, the expected discounted infinite horizon reward starting in state i and following policy (Delta n)
Step 3: Policy Improvement, Look for a stationary policy that is better than (Delta n)
For all i calculate, (Delta n+1), the action that maximises the expected discounted reward for decision 1 and decision 2. So we are calculating 2 expected reward, t hen that is the policy that we want to choose.
Step 4: Stopping Criteria.
If (Delta n= (Delta n)+1), (Delta n) is an optimal policy otherwise let ( n = n+1) and repeat from step 2.
Advantage:
The algorithm is guaranteed to terminate in a finite number of iterations.
Disadvantage:
Each step involves a lot of calculation for finding one and then improving it. As compared to value iterations it is simple but a lot for one.
Vale Iteration and Policy Iteration
Both methods find the unique values for all i, good or approximations to them.
- Value iterations take many iterations to converge as compared to Policy
- Each iteration takes a lot of calculation for policy, as you have to improve it.
- In reality, its a good approximation of the optimal value because numerical errors such as rounding errors mean that a you get a good approximation.
- The value iteration only aims to find a good approximation in some way
- Both methods find a good approximation of the stationary optimal policy.