Module 8: Markov Decision Process Flashcards
Which of the following statements is true of a Markov decision process or MDP?
An MDP is defined for a fully observable, stochastic environment.
A solution or policy must specify what the agent should do for all states that it can reach.
Discounted reward is not absolutely necessary, proper policy might not always exist - see for example Figure 17.2 bottom right.
Select all the following components that the policy iteration algorithm equation takes into account.
The probability of entering a state Sā from state S after performing action A.
The utility associated with a state S.
All of the actions A that an agent can take.
Max reward that can be obtained after being in state S is used in value iteration, not policy iteration.
T/F
Policy is implicitly updated in value iteration.
True
Policy corresponds to the action paired with the max Q-value.
T/F
The utility function estimate must be completely accurate in order to get an optimal policy.?
False
The utility is a human defined function to evaluate the state and thus it cannot be completely correct. But it could still obtain an optimal policy as long as it describes the state properly.
T/F
Policy is explicitly updated in value iteration.
False
Policy is implicitly updated in value iteration, see textbook section 17.2.
What is the principle of MEU?
An agent should maximize the weighted average of their utilities.
Refer to the equation of MEU in the slides.