MP and MRP Flashcards

Question 1

Q

What is the formal definition of a Markov process?

Answer

A

A Markov process is a tuple (S, Pss’)

1) S, state space
2) P_ss’, state transition probability matrix

Question 2

Q

On what does the probability of the next state, S_t+1 depend?

Answer

A

Only the current state s_t. It is independent of former states.

Question 3

Q

What is the formal definition of a Markov Reward Process?

Answer

A

A Markov Reward Process is a tuple (S, P, R, gamma):

1) S, state space
2) Pss’, state transition probability matrix
3) Rs, the immediate reward for leaving state s.
4) gamma, the discount factor, usually in the range [0, 1]

Notice that R is the “fancy” R. Normal R is usually used for the Return, and should not be confused by the immediate reward.

Question 4

Q

Why is the discount factor usually smaller than 1?

Answer

A

1) Mathematical convenient, (will converge)
2) Avoids infinite loops
3) Might be the truth for some cases, like financial rewards
4) Models uncertainty of future rewards
5) Research into human and animal decision making shows preference of imidate rewards.

Question 5

Q

What is the idea of the Belman equation

Answer

A

Rewards can be decomposed into immediate and future rewards.

Question 6

Q

What are the most important forms of the Belman equation for MRP’s

Answer

A

1.Expectation:
v(s) = E[R_t |S_t = s] (R is the return)
2.Sum:
v(s) = r_t + gamma* sum_s'[P_ss' * v(s')
Vector:
v = R(Imidiate) + gamma*P_ss'*v

Question 7

Q

Define the Return

Answer

A

R_t = r_t+1 + gammar_t+2 + gamma**2r_t+3….

Question 8

Q

Define the state-value function

Answer

A

v(s) = E[R_t |S_t = s], R is the return

Question 9

Q

What is the direct solution of the Belman equation for MRP’s using vector form?

Answer

A

v = (I - gammaP_ss’)^(-1)R(imidiate)

Question 10

Q

Why can’t we always use the direct solution of the Belman equation?

Answer

A

Calculating the matrix inverse might be to computational expensive.

Question 11

Q

What 3 main iterative methods do we have for solving the Belman equation?

Answer

A

1) Monte-Carlo evaluation
2) Dynamic Programming
3) Temporal-difference learning

Question 12

Q

What happens when gamma, the discount factor, is close to 0 and 1

Answer

A

Close to 0 leads to a “short-sighted” evaluation while close to 1 leads to a “far-sighted” evaluation

MP and MRP Flashcards

(12 cards)