MDP Flashcards
What’s the formal definition of a Markov Decision Process
A tuple of:
- S, State space
- A, Action space
- P^1_ss’, transition probability matrix, collected at the transition from s to s’
- gamma, the discount factor
- R^a_ss’, the immediate reward for leaving state s.
- pi(a|s) probability for a given s, can also be deterministic.
How is the value function for an MDP defined?
V^{pi}(s) = E[R_t | S_t=s]
Why is the first term in the MRP value function Rs while the first term of the MDP value function is E[r_t+1 | S_t = s] (expected value)?
The reward function of an MDP depends on the state (s), action (a) and next state(s’). These can be probabilistic, so we have to use the expected value.
What’s the unpacked form of V^{pi}(s)
V^{pi}(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^{pi}(s’))
Describe the iterative policy evaluation algorithm
initalize V(s) to 0 for all s
repeat:
for each s in S:
V(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^ V^{pi}(s’))
until termination criterion is reached.
Describe the difference between synchronous and asynchronous updates in the iterative algorithm
Async updates update the value function for each state continuously, while sync. update calculates all new values before updating the value function.
what is the state-action value function, Q^pi(s,a)?
It describes how good it is to take action a in state s given policy pi. It i connected to the (state) value function by:
V(s) = sum pi(s, a) * sum Q^pi(s,a)
Does there always exist an optimal policy in MDPS?
Yes, at least one, but there might be more
How is an optimal policy defined?
A policy, pi, is optimal if V_pi(s) >= V_pi(s) for all other pi and all states s.
What does the belman optimality equation for V state for MDP?
The value of a state under an optimal policy, must be equal to the expected return for the best action in that state.