States-based models with search optimization and MDP Flashcards

Question

MDP - Optimal Q-Value

Answer 1

Q_opt(*s,a*) of state *s* with action *a* is defined to be the maximum Q-value attained by any policy starting. It is computed as follows: Q_opt(*s,a*)=∑_{s′∈ States}T(*s,a,s′*)[Reward(*s,a,s′*)+γV_opt(*s′*)]

Answer 2

Specifies the probability of going to state *s′* after action aais taken in state ss. Each *s′*↦T(*s,a,s′*) is a probability distribution, which means that:

Answer 3

States seen for which we are still figuring out how to get there with the cheapest cost

Answer 4

States for which the optimal path has already been found

Answer 5

By noting *V* the value function, there are 3 properties around minimax to have in mind: 1. Property 1: if the agent were to change its policy to any π_agent, then the agent would be no better off. ∀π_agent,V(π_max,π_min)⩾V(π_agent,π_min) 2. Property 2: if the opponent changes its policy from π_min to π_opp, then he will be no better off. ∀πopp,V(π_max,π_min)⩽V(π_max,π_opp) 3. Property 3: if the opponent is known to be not playing the adversarial policy, then the minimax policy might not be optimal for the agent. ∀π,V(π_max,π)⩽V(π_exptmax,π) In the end, we have the following relationship: V(π_exptmax,π_min)⩽V(π_max,π_min)⩽V(π_max,π)⩽V(π_exptmax,π)

Answer 6

This trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c⩾0.

Answer 7

A function *h* over states *s*, where each *h(s)* aims at estimating FutureCost(*s*), the cost of the path from ss to s_end.

Answer 8

The expected utility by following policy π from state *s* over random paths. It is defined as follows: V_π(s)=Q_π(*s,π(s)*) Remark: V_π(*s*) is equal to 0 if *s* is an end state.

Answer 9

Deterministic policies, noted π_p(s), which are actions that player pp takes in state ss. Stochastic policies, noted π_p(s,a)∈[0,1], which are probabilities that player *p* takes action *a* in state *s*.

Answer 10

An off-policy algorithm that produces an estimate for Q_opt. On each *(s,a,r,s′,a′)*, we have:

Answer 11

\_\_\_\_\_ is a search algorithm that aims at finding the shortest path from a state ss to an end state s_end. It explores states ss in increasing order of PastCost(*s*)+h(*s*). It is equivalent to a uniform cost search with edge costs Cost′(s,a) given by:

Answer 12

Suppose we are not given the values of Cost(*s*,*a*), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a₁,a₂,...,a_k).

Answer 13

This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.

Answer 14

Let h₁(s), h₂(s) be two heuristics. We have the following property: h1(s), h2(s) consistent⟹h(s)=max{h1(s), h2(s)} consistent

Answer 15

The expected utility from state *s* after taking action *a* and then following policy π. It is defined as follows: Q_π(*s,a*)=∑_{s′∈ States}T(*s,a,s′*)[Reward(*s,a,s′*)+γV_π(*s′*)]

Answer 16

A bootstrapping method estimating *Qπ* by using both raw data and estimates as part of the update rule. For each *(s,a,r,s′,a′)*, we have: ## Footnote Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.

Answer 17

Aims at estimating T(*s,a,s′*) and Reward(*s,a,s′*) using Monte Carlo simulation with: These estimations will be then used to deduce *Q*-values, including *Q_π* and Qopt. Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.

Answer 18

π_opt is defined as being the policy that leads to the optimal values. It is defined by: ∀s,π_opt(s)=argmax_{a∈ Actions(s)}Q_opt(s,a)

Answer 19

V_opt(*s*) of state *s* is defined as being the maximum value attained by any policy. It is computed as follows: V_opt(s)=max_{a∈ Actions(s)}Q_opt(*s,a*)

Answer 20

Aims at directly estimating Q_π, as follows: where *u_t* denotes the utility starting at step *t* of a given episode. Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy ππused to generate the data.

Answer 21

States not seen yet

Answer 22

A graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c⩾0

Answer 23

For a given state ss, the expectimax value V_exptmax(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy π_opp. It is computed as follows: ## Footnote Remark: expectimax is the analog of value iteration for MDPs.

Answer 24

1. Computational efficiency 2. Good approximation

Answer 25

The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it: decreases the estimated cost of each state-action of the true minimizing path *y* given by the training data, increases the estimated cost of each state-action of the current predicted path *y′* inferred from the learned weights. Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action aa, and the other parametrizes Cost(*s*,*a*) to a feature vector of learnable weights.

Answer 26

The objective is to find a path that minimizes the cost.