MDP Flashcards

1
Q

What’s the formal definition of a Markov Decision Process

A

A tuple of:

  1. S, State space
  2. A, Action space
  3. P^1_ss’, transition probability matrix, collected at the transition from s to s’
  4. gamma, the discount factor
  5. R^a_ss’, the immediate reward for leaving state s.
  6. pi(a|s) probability for a given s, can also be deterministic.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the value function for an MDP defined?

A

V^{pi}(s) = E[R_t | S_t=s]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is the first term in the MRP value function Rs while the first term of the MDP value function is E[r_t+1 | S_t = s] (expected value)?

A

The reward function of an MDP depends on the state (s), action (a) and next state(s’). These can be probabilistic, so we have to use the expected value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the unpacked form of V^{pi}(s)

A

V^{pi}(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^{pi}(s’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the iterative policy evaluation algorithm

A

initalize V(s) to 0 for all s

repeat:
for each s in S:
V(s) = sum policy(a |s) sum P^a_ss’(R^{a}_ss’ + gamma V^ V^{pi}(s’))
until termination criterion is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the difference between synchronous and asynchronous updates in the iterative algorithm

A

Async updates update the value function for each state continuously, while sync. update calculates all new values before updating the value function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is the state-action value function, Q^pi(s,a)?

A

It describes how good it is to take action a in state s given policy pi. It i connected to the (state) value function by:

V(s) = sum pi(s, a) * sum Q^pi(s,a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Does there always exist an optimal policy in MDPS?

A

Yes, at least one, but there might be more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is an optimal policy defined?

A

A policy, pi, is optimal if V_pi(s) >= V_pi(s) for all other pi and all states s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the belman optimality equation for V state for MDP?

A

The value of a state under an optimal policy, must be equal to the expected return for the best action in that state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly