MP and MRP Flashcards
What is the formal definition of a Markov process?
A Markov process is a tuple (S, Pss’)
1) S, state space
2) P_ss’, state transition probability matrix
On what does the probability of the next state, S_t+1 depend?
Only the current state s_t. It is independent of former states.
What is the formal definition of a Markov Reward Process?
A Markov Reward Process is a tuple (S, P, R, gamma):
1) S, state space
2) Pss’, state transition probability matrix
3) Rs, the immediate reward for leaving state s.
4) gamma, the discount factor, usually in the range [0, 1]
Notice that R is the “fancy” R. Normal R is usually used for the Return, and should not be confused by the immediate reward.
Why is the discount factor usually smaller than 1?
1) Mathematical convenient, (will converge)
2) Avoids infinite loops
3) Might be the truth for some cases, like financial rewards
4) Models uncertainty of future rewards
5) Research into human and animal decision making shows preference of imidate rewards.
What is the idea of the Belman equation
Rewards can be decomposed into immediate and future rewards.
What are the most important forms of the Belman equation for MRP’s
1.Expectation: v(s) = E[R_t |S_t = s] (R is the return) 2.Sum: v(s) = r_t + gamma* sum_s'[P_ss' * v(s') Vector: v = R(Imidiate) + gamma*P_ss'*v
Define the Return
R_t = r_t+1 + gammar_t+2 + gamma**2r_t+3….
Define the state-value function
v(s) = E[R_t |S_t = s], R is the return
What is the direct solution of the Belman equation for MRP’s using vector form?
v = (I - gammaP_ss’)^(-1)R(imidiate)
Why can’t we always use the direct solution of the Belman equation?
Calculating the matrix inverse might be to computational expensive.
What 3 main iterative methods do we have for solving the Belman equation?
1) Monte-Carlo evaluation
2) Dynamic Programming
3) Temporal-difference learning
What happens when gamma, the discount factor, is close to 0 and 1
Close to 0 leads to a “short-sighted” evaluation while close to 1 leads to a “far-sighted” evaluation