Markov Decision Process Flashcards

Question 1

Q

What are Markov Decision Processes?

Answer

A

Is a dynamic system whose future probabilistic behaviour depends on the (1)present state & (2)the decision taken, which is subject to some uncertainty

process predict how a process is going to evolve, but we don’t have control over that.
dynamic system
evolving probabilistically, according to present state
further behaviour depends only on present state
unlike MC we need to make a decision, so we have the action description to make the decision.
the reward we get is dependant on the action and decision chosen.
TP: depends on both the state and the action.

We only consider the station MDP

In MC we don’t have maximise k while K is belong to Ki, because in Markov Chain we don’t need to make decisions.

Question 2

Q

Elements of a Markov Decision Processes

Answer

A

State Description: similar to Markov Chain
Action Description: record decision description
Immediate Reward: generated on action choice
For each reward is received when process is in state i and action k
State Transitions: transition probabilities
For each the probability of process making a transition to state k when process is in state i and action k is chosen.
Optimality Equation: Unlike MC, depend on action
Depends on 1. Decision Criteria & 2. Planning Horizon

Value Function :
V^ni = enables us to maximise the expected total reward

Question 3

Q

Expected total reward over finite horizon

Answer

A

Optimality Equation is the expected total reward over a finite horizon (choosing decision to maximise value).

V^ni = maximum expected total reward over n steps when the process is in state i initially.

Question 4

Q

Expected discounted reward over finite horizon (+ B eta)

Answer

A

We should consider the discount factor beta. The reward received of one pound received in period 1 will be worse beta pounds than today.

V^ni = maximum discount reward over N steps when the process is in state i initially
- V0 = 0

Question 5

Q

Expected discounted reward over infinite horizon

Answer

A

In many cases there will be a stationary optimal policy and so with a finite planning horizon, the policy that can be recommended is often a stationary one.

Image you are in the beginning of the planning horizon and the process is in state 1, then you will essentially have to make a decision that somehow balance the immediate reward and the expected future rewards. The future reward is still over an infinite planning horizon, it seems sensible to have the optimal policy as the stationary policy.

Guaranteed to find a stationary optimal policy if we are trying to maximise the expected discounted reward over an infinite planning horizon.

Vni : maximum expected total reward over the next n weeks when the process is in state i, initially.

Find the limit of Vni/n tends to infinity.

You have to do a-lot of calculations to understand that the average reward per step doesn’t depend on the starting value.
Initial decreases are big, then the decreases decrease by smaller amounts.

Question 6

Q

Policy for MDP

Answer

A

A policy for a MDP is a rule that prescribes the action to take in each state at each step.

A rule that will prescribe the action for all possible states at each steo

Question 7

Q

Policy for MDP

Answer

A

A policy for a MDP is a rule that prescribes the action to take in each state at each step.

A rule that will prescribe the action for all possible states at each step.

A decision maker can use such a rule to determine which action to take based on where they currently are in the planning horizon and which state the process is in.

Aim of the Policy: (Criterion)

maximum expected total reward
maximum expected discounted reward
maximum long-run average reward per step.

Question 8

Q

Policy for MDP

Answer

A

A policy for a MDP is a rule that prescribes the action to take in each state at each step.

A rule that will prescribe the action for all possible states at each step.

A decision maker can use such a rule to determine which action to take based on where they currently are in the planning horizon and which state the process is in.

Aim of the Policy: (Criterion)

maximum expected total reward
maximum expected discounted reward
maximum long-run average reward per step.

Once we fix a policy, we basically get a Markov Chain, so then we can find the long run distribution of the state and ensure that the long run average reward is as such.

Question 9

Q

Stationary Policy

Answer

A

In a stationary Markov Decision Process, the process, the states, actions, immediate rewards and transaction probabilities are the same at every stage of the process.

Although the MDP is stationary, the states depend on time and the rewards, action and transition probabilities do not depend on time.

Is a policy that prescribes a single action whenever the process is in state i.
Take the same action for any time in planing horizon.

If we are implementing a stationary policy on a process, the process evolves is then determined by a Markov Chain. As by specifying the decision taken in each stage, we will specify the transition probabilities for each state.

We can see how the process is moving between the states over time. We can say something about the properties of how the process will evolve based on any methods we were using for Markov Chains.

Question 10

Q

Infinite Planning Horizon

Answer

A

Argue that they might reasonably provide a good policy for a planning horizon which is uncertain in length or is a very long planning horizon.

It is not certain that policy that we obtain will be applied forever. If there is a change to the underlying situation then we would recalculate the other optimal policy.

So very often that we would resolve the problem, assuming an infinite planning horizon. We may get a stationary policy that is quite convenient to apply. But we don’t apply that for the first few decisions and then we would want to re-examine the situation. We re- optimise the problem.

In general, expected total reward is not suitable as a decision criterion.

Suppose that rki>0 for all i & k. In each step we get a positive reward, if we add all the reward up then we get an infinite total reward. In such situations, any decision over an infinite period, the total reward will be infinite. As a result, whether they are good or bad policy they would be seen as the same, this would make it hard to make a decision on which policy is better. The problem is that expected total reward not being suitable for the infinite planning horizon.

This problem can be overcome through

Long run average reward per step
Expected discounted reward

Question 11

Q

Long run average reward per step

Answer

A

Vni= maximum expected total reward over n steps when the process is in state i initially.

Maximum long-run average reward per step when the process is on state i initially
= limit of Vni/n tends to infinity.

In many cases the long run average reward per step is independent of the starting state. We find that the long run average reward per step doesn’t depend on the starting state. This is similar to the long run distribution of the state of Markov Chain.

This decision will influence the transition probabilities. So it is tricker to specify conditions where this doesn’t depend on the starting state.

If you use such a criterion and the policy is really influenced by the long-run behaviour. So the emphasis is on the distant future, then the policy may take decisions which are relatively more expensive in the short run to in order to get into a good position to take an advantage of that in the long term.

If a decision maker does want such a policy then we should use the long run average reward per step. This criterion is based on behaviour of future so the early rewards doesn’t matter at all.

if the decision maker wants to only implement the policy for a few periods then this may not be a sensible criterion to use as your decision can be very expensive.

Question 12

Q

Expected discounted reward

over an infinite horizon

Answer

A

Highest Reward is C
Worst Reward is -C
Rewards are bounded within this

This means that maximum infinite horizon expected discounted reward is finite.
we apply the discount factors

Vni= maximum expected discounted reward over an infinite horizon when the process is in state i initially

Let vi be the expected discounted reward over an infinite horizon using the stationary policy (delta) and starting in state i

Find a policy to maximise the Optimality E
Optimality Equation: (|remove n-1), it is no longer an iterative process. In the finite case, we can specify v0 to find v1, but here we are trying to find vi.
( It gets more difficult to solve)

In contrast with the long run average reward per step, with this criterion the expected discounted reward over an infinite horizon, it is clear that the early reward are the most important ones.

How you define early?
Depends on the value of beta, as beta gets closer to 1, the early rewards present for longer. If beta closer to 0, or gets smaller, the shorter time determines the rewards.

Delta is called the delta-optimal policy.

2 possible method to solve the optimal value 1. Value iteration, 2. Policy Iteration

Question 13

Q

Value Iteration - Successive Approximations

Coursework

Answer

A

Simplest
To solve a finite planning horizon but solve it repeatedly for longer and longer planning horizon. You solve one step problem and then extend that 2-step problem, to a 3-step problem and keep going.

Step 1: Initialising: Set V0i = 0 for all states

Step 2: Iteration - Calculate for finite stationary discounted MDP. Optimality Equation. We keep a note of the action that will maximise the value of Vni

Step 3: Stopping Criterion: If stopping criteria is not satisfied n=n+1 and repeat from step 2. Increasing the value of n will be very close to the optimal value, that moment the policy we get will be very close to the stationary optimal policy that we get.

Issue: How many iterations do we have to do before we can be satisfies with the policy? Or at-least a good approximation of the optimal policy.

Solution: Bounds on value iterations. The values calculated at each iteration of the value iteration algorithm can be used to obtain bounds on the optimal values of the infinite horizon policy.

the result is the maximum absolute differences between the optimal value and the optimal values that we calculated in iterations n in the process.
tells us how close we are the optimal value, En - the error at the last iteration
Delta is 2En
En = 0 (within 0.0005 of the optimal value)

Question 14

Q

Policy Iteration

Answer

A

Find a stationary policy
Try to improve the policy
Evaluate and if you can improve, then repeat

Step 1: Initialisation, Start the process with a stationary policy. (Delta 1) and let n =1

Step 2: Policy evaluation. For all i calculate Vi, the expected discounted infinite horizon reward starting in state i and following policy (Delta n)

Step 3: Policy Improvement, Look for a stationary policy that is better than (Delta n)
For all i calculate, (Delta n+1), the action that maximises the expected discounted reward for decision 1 and decision 2. So we are calculating 2 expected reward, t hen that is the policy that we want to choose.

Step 4: Stopping Criteria.
If (Delta n= (Delta n)+1), (Delta n) is an optimal policy otherwise let ( n = n+1) and repeat from step 2.

Advantage:
The algorithm is guaranteed to terminate in a finite number of iterations.

Disadvantage:
Each step involves a lot of calculation for finding one and then improving it. As compared to value iterations it is simple but a lot for one.

Question 15

Q

Vale Iteration and Policy Iteration

Answer

A

Both methods find the unique values for all i, good or approximations to them.

Value iterations take many iterations to converge as compared to Policy
Each iteration takes a lot of calculation for policy, as you have to improve it.
In reality, its a good approximation of the optimal value because numerical errors such as rounding errors mean that a you get a good approximation.
The value iteration only aims to find a good approximation in some way
Both methods find a good approximation of the stationary optimal policy.

Question 16

Q

Value Iteration Steps

Answer

Study These Flashcards

A

Initialisation: 
Let V0i = 0 for all i ∈ S and let n = 1

Iteration:
Calculate Vni for all i ∈ S and
Let δni = action that maximises the RHS for the optimality equation for all i ∈ S.

Stopping Criterion
If stopping criterion is not satisfied let n=n+1 and repeat from step 2

As n increases to infinite, the Vni will be close to V*i, and then the δn will be close to the stationary optimal policy.

The Error Term: En = B/1-B *(max [j∈ S] of |vnj – vjn-1|)
(Beta/1-Beta |8 maximum of the absolute differences between two successive iterations)

Question 17

Q

Policy Iteration Steps

Answer

Study These Flashcards

A

Initialisation: 
Choose a stationary policy δn let n = 1

Policy Evaluation:
For all i ∈ S calculate Vi, the expected discounted infinite horizon reward starting in state i and following policy δn.

Policy Improvement
For all i ∈ S and k ∈ Ki, calculate δin+1, the action that maximises

Stopping Criterion
If δn= δn+1, δn is an optimal policy, otherwise let n=n+1 and repeat from step 2

Markov Decision Process Flashcards

(17 cards)