MDP Flashcards

Question 1

Q

What’s the formal definition of a Markov Decision Process

Answer

A

A tuple of:

S, State space
A, Action space
P^1_ss’, transition probability matrix, collected at the transition from s to s’
gamma, the discount factor
R^a_ss’, the immediate reward for leaving state s.
pi(a|s) probability for a given s, can also be deterministic.

Question 2

Q

How is the value function for an MDP defined?

Answer

A

V^{pi}(s) = E[R_t | S_t=s]

Question 3

Q

Why is the first term in the MRP value function Rs while the first term of the MDP value function is E[r_t+1 | S_t = s] (expected value)?

Answer

A

The reward function of an MDP depends on the state (s), action (a) and next state(s’). These can be probabilistic, so we have to use the expected value.

Question 4

Q

What’s the unpacked form of V^{pi}(s)

Answer

A

V^{pi}(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^{pi}(s’))

Question 5

Q

Describe the iterative policy evaluation algorithm

Answer

A

initalize V(s) to 0 for all s

repeat:
for each s in S:
V(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^ V^{pi}(s’))
until termination criterion is reached.

Question 6

Q

Describe the difference between synchronous and asynchronous updates in the iterative algorithm

Answer

A

Async updates update the value function for each state continuously, while sync. update calculates all new values before updating the value function.

Question 7

Q

what is the state-action value function, Q^pi(s,a)?

Answer

A

It describes how good it is to take action a in state s given policy pi. It i connected to the (state) value function by:

V(s) = sum pi(s, a) * sum Q^pi(s,a)

Question 8

Q

Does there always exist an optimal policy in MDPS?

Answer

A

Yes, at least one, but there might be more

Question 9

Q

How is an optimal policy defined?

Answer

A

A policy, pi, is optimal if V_pi(s) >= V_pi(s) for all other pi and all states s.

Question 10

Q

What does the belman optimality equation for V state for MDP?

Answer

A

The value of a state under an optimal policy, must be equal to the expected return for the best action in that state.