States-based models with search optimization and MDP Flashcards
MDP - Value Iteration
An algorithm that finds the optimal value Vopt as well as the optimal policy πopt. It is done as follows:
Initialization: for all states s, we have
Vopt(0)(s)⟵0
Iteration: for t from 1 to TVI, we have
∀s,Vopt(t)(s)⟵maxa∈ Actions(s)Qopt(t−1)(s,a)
with
Qopt(t−1)(s,a)=∑s′∈ StatesT(s,a,s′)[Reward(s,a,s′)+γVopt(t−1)(s′)]
Remark: if we have either γ<1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.
Game tree
A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.
For what types of action costs would Uniform Cost Search fail?
Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.
Minimax
The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent’s utility. It is done as follows:
Remark: we can extract πmax and πmin from the minimax value Vminimax.

A* heuristic - admissibility
A heuristic h is said to be ______ if we have:
h(s)⩽FutureCost(s)
MDP - Utility
The discounted sum of the rewards on that path. In other words,
u(s0,…,sk)=k∑i=1riγi−1

Correctness Theorem
When a state ss is popped from the frontier F and moved to explored set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.
A* heuristic - correctness
If h is ____, then A∗ returns the minimum cost path.
Markov decision process
The objective of a Markov decision process is to maximize rewards. It is defined with:
- A starting state sstart
- Possible actions Actions(s) from state ss
- Transition probabilities T(s,a,s′) from s to s′ with action a
- Rewards Reward(s,a,s′) from s to s′ with action a
- whether an end state was reached IsEnd(s)
- a discount factor 0⩽γ⩽1

Two-player zero-sum game
It is a game where each state is fully observed and such that players take turns. It is defined with:
- a starting state sstart
- possible actions Actions(s) from state s
- successors Succ(s,a) from states s with actions a
- whether an end state was reached IsEnd(s)
- the agent’s utility Utility(s) at end state s
- the player Player(s) who controls state s
Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.
MDP - Policy
A function that maps each state s to an action a, i.e.
π: s↦a
Uniform Cost Search
A search algorithm that aims at finding the shortest path from a state s<span>start</span> to an end state s<span>end</span>. It explores states s in increasing order of PastCost(s) and relies on the fact that all action costs are non-negative.

MDP - Policy Evaluation
Given a policy π, this is an iterative algorithm that aims at estimating Vπ. It is done as follows:
Initialization: for all states ss, we have
Vπ(0)(s)⟵0
Iteration: for t from 1 to TPE, we have
∀s,Vπ(t)(s)⟵Qπ(t−1)(s,π(s))
with
Qπ(t−1)(s,π(s))=∑s′∈ StatesT(s,π(s),s′)[Reward(s,π(s),s′)+γVπ(t−1)(s′)]
Remark: by noting S the number of states, A the number of actions per state, S′ the number of successors and T the number of iterations, then the time complexity is of O(TPESS′).
Tree Search
This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.
A* heuristic - consistency
A heuristic h is said to be _____ if it satisfies the two following properties:
- For all states s and actions a
- h(s)⩽Cost(s,a)+h(Succ(s,a))
- The end state verifies the following:
- h(send)=0
Alpha-beta pruning
A domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in αα for the maximizing player and in β for the minimizing player). At a given step, the condition β<α means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.

Backtracking Search
A naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.
Tree search algorithms summary - By noting b the number of actions per state, dd the solution depth, and D the maximum depth, we have:

Dynamic Programming -
A backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state s to an end state send. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state ss, the future cost is computed as follows:

Evaluation Function
A domain-specific and approximate estimate of the value Vminimax(s). It is noted Eval(s).