States-based models with search optimization and MDP Flashcards

1
Q

MDP - Value Iteration

A

An algorithm that finds the optimal value Vopt as well as the optimal policy πopt. It is done as follows:

Initialization: for all states s, we have

Vopt(0)(s)⟵0

Iteration: for t from 1 to TVI, we have

∀s,Vopt(t)(s)⟵maxa∈ Actions(s)Qopt(t−1)(s,a)

with

Qopt(t−1)(s,a)=∑s′∈ StatesT(s,a,s′)[Reward(s,a,s′)+γVopt(t−1)(s′)]

Remark: if we have either γ<1 or the MDP graph being acyclic, then the value iteration algorithm is guaranteed to converge to the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Game tree

A

A game tree is a tree that describes the possibilities of a game. In particular, each node is a decision point for a player and each root-to-leaf path is a possible outcome of the game.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For what types of action costs would Uniform Cost Search fail?

A

Remark 2: the algorithm would not work for a problem with negative action costs, and adding a positive constant to make them non-negative would not solve the problem since this would end up being a different problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Minimax

A

The goal of minimax policies is to find an optimal policy against an adversary by assuming the worst case, i.e. that the opponent is doing everything to minimize the agent’s utility. It is done as follows:

Remark: we can extract πmax and πmin from the minimax value Vminimax.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A* heuristic - admissibility

A

A heuristic h is said to be ______ if we have:

h(s)⩽FutureCost(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MDP - Utility

A

The discounted sum of the rewards on that path. In other words,

u(s0,…,sk)=ki=1riγi−1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Correctness Theorem

A

When a state ss is popped from the frontier F and moved to explored set E, its priority is equal to PastCost(s) which is the minimum cost path from sstart to s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A* heuristic - correctness

A

If h is ____, then A returns the minimum cost path.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Markov decision process

The objective of a Markov decision process is to maximize rewards. It is defined with:

A
  • A starting state sstart
  • Possible actions Actions(s) from state ss
  • Transition probabilities T(s,a,s′) from s to s′ with action a
  • Rewards Reward(s,a,s′) from s to s′ with action a
  • whether an end state was reached IsEnd(s)
  • a discount factor 0⩽γ⩽1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Two-player zero-sum game

A

It is a game where each state is fully observed and such that players take turns. It is defined with:

  • a starting state sstart
  • possible actions Actions(s) from state s
  • successors Succ(s,a) from states s with actions a
  • whether an end state was reached IsEnd(s)
  • the agent’s utility Utility(s) at end state s
  • the player Player(s) who controls state s

Remark: we will assume that the utility of the agent has the opposite sign of the one of the opponent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MDP - Policy

A

A function that maps each state s to an action a, i.e.

π: s↦a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Uniform Cost Search

A

A search algorithm that aims at finding the shortest path from a state s<span>start</span> to an end state s<span>end</span>. It explores states s in increasing order of PastCost(s) and relies on the fact that all action costs are non-negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

MDP - Policy Evaluation

A

Given a policy π, this is an iterative algorithm that aims at estimating Vπ. It is done as follows:

Initialization: for all states ss, we have

Vπ(0)(s)⟵0

Iteration: for t from 1 to TPE, we have

∀s,Vπ(t)(s)⟵Qπ(t−1)(s,π(s))

with

Qπ(t−1)(s,π(s))=∑s′∈ StatesT(s,π(s),s′)[Reward(s,π(s),s′)+γVπ(t−1)(s′)]

Remark: by noting S the number of states, A the number of actions per state, S′ the number of successors and T the number of iterations, then the time complexity is of O(TPESS′).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Tree Search

A

This category of states-based algorithms explores all possible states and actions. It is quite memory efficient, and is suitable for huge state spaces but the runtime can become exponential in the worst cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A* heuristic - consistency

A

​A heuristic h is said to be _____ if it satisfies the two following properties:

  • For all states s and actions a
    • h(s)⩽Cost(s,a)+h(Succ(s,a))
  • The end state verifies the following:
    • h(send)=0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Alpha-beta pruning

A

A domain-general exact method optimizing the minimax algorithm by avoiding the unnecessary exploration of parts of the game tree. To do so, each player keeps track of the best value they can hope for (stored in αα for the maximizing player and in β for the minimizing player). At a given step, the condition β<α means that the optimal path is not going to be in the current branch as the earlier player had a better option at their disposal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Backtracking Search

A

A naive recursive algorithm that tries all possibilities to find the minimum cost path. Here, action costs can be either positive or negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tree search algorithms summary - By noting b the number of actions per state, dd the solution depth, and D the maximum depth, we have:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Dynamic Programming -

A

A backtracking search algorithm with memoization (i.e. partial results are saved) whose goal is to find a minimum cost path from state s to an end state send. It can potentially have exponential savings compared to traditional graph search algorithms, and has the property to only work for acyclic graphs. For any given state ss, the future cost is computed as follows:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Evaluation Function

A

A domain-specific and approximate estimate of the value Vminimax(s). It is noted Eval(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Temporal difference (TD) learning

A

Used when we don’t know the transitions/rewards. The value is based on exploration policy. To be able to use it, we need to know rules of the game Succ(s,a). For each (s,a,r,s′), the update is done as follows:

w⟵w−η[V(s,w)−(r+γV(s′,w))]∇wV(s,w)

22
Q

Epsilon-greedy

A

This policy algorithm balances exploration with probability ϵ and exploitation with probability 1−ϵ. For a given state s, the policy πact is computed as follows:

23
Q

Depth-first search (DFS)

A

A search algorithm that traverses a graph by following each path as deep as it can. We can implement it recursively, or iteratively with the help of a stack that stores at each step future nodes to be visited. For this algorithm, action costs are assumed to be equal to 0.

24
Q

Relaxed Search Problem

A

The relaxation of search problem P with costs Cost is noted Prel with costs Costrel, and satisfies the identity:

Costrel(s,a)⩽Cost(s,a)

25
Q

MDP - Optimal Q-Value

A

Qopt(s,a) of state s with action a is defined to be the maximum Q-value attained by any policy starting. It is computed as follows:

Qopt(s,a)=∑<em>s′∈ States</em>T(s,a,s′)[Reward(s,a,s′)+γV<span>opt</span>(s′)]

26
Q

MDP - Transition Probability

A

Specifies the probability of going to state s′ after action aais taken in state ss. Each s′↦T(s,a,s′) is a probability distribution, which means that:

27
Q

Frontier F

A

States seen for which we are still figuring out how to get there with the cheapest cost

28
Q

Explored E

A

States for which the optimal path has already been found

29
Q

Minimax properties

A

By noting V the value function, there are 3 properties around minimax to have in mind:

  1. Property 1: if the agent were to change its policy to any πagent, then the agent would be no better off.

∀πagent,V(πmaxmin)⩾V(πagentmin)
2. Property 2: if the opponent changes its policy from πmin to πopp, then he will be no better off.

∀πopp,V(πmaxmin)⩽V(πmaxopp)
3. Property 3: if the opponent is known to be not playing the adversarial policy, then the minimax policy might not be optimal for the agent.

∀π,V(πmax,π)⩽V(πexptmax,π)

In the end, we have the following relationship:

V(πexptmaxmin)⩽V(πmaxmin)⩽V(πmax,π)⩽V(πexptmax,π)

30
Q

Iterative deepening

A

This trick is a modification of the depth-first search algorithm so that it stops after reaching a certain depth, which guarantees optimality when all action costs are equal. Here, we assume that action costs are equal to a constant c⩾0.

31
Q

A* Search - heuristic

A

A function h over states s, where each h(s) aims at estimating FutureCost(s), the cost of the path from ss to send.

32
Q

MDP - Value of a policy

A

The expected utility by following policy π from state s over random paths. It is defined as follows:

Vπ(s)=Qπ(s,π(s))

Remark: Vπ(s) is equal to 0 if s is an end state.

33
Q

What are the two types of game policies?

A

Deterministic policies, noted πp(s), which are actions that player pp takes in state ss.

Stochastic policies, noted πp(s,a)∈[0,1], which are probabilities that player p takes action a in state s.

34
Q

Q-learning

A

An off-policy algorithm that produces an estimate for Qopt. On each (s,a,r,s′,a′), we have:

35
Q

A* search - Algorithm

A

_____ is a search algorithm that aims at finding the shortest path from a state ss to an end state send. It explores states ss in increasing order of PastCost(s)+h(s). It is equivalent to a uniform cost search with edge costs Cost′(s,a) given by:

36
Q

Learning costs

A

Suppose we are not given the values of Cost(s,a), we want to estimate these quantities from a training set of minimizing-cost-path sequence of actions (a1,a2,…,ak).

37
Q

Graph Search

A

This category of states-based algorithms aims at constructing optimal paths, enabling exponential savings. In this section, we will focus on dynamic programming and uniform cost search.

38
Q

Max heuristic

A

Let h1(s), h2(s) be two heuristics. We have the following property:

h1(s), h2(s) consistent⟹h(s)=max{h1(s), h2(s)} consistent

39
Q

MDP - Q-value

A

The expected utility from state s after taking action a and then following policy π. It is defined as follows:

Qπ(s,a)=∑s′∈ StatesT(s,a,s′)[Reward(s,a,s′)+γVπ(s′)]

40
Q

State-action-reward-state-action (SARSA)

A

A bootstrapping method estimating by using both raw data and estimates as part of the update rule. For each (s,a,r,s′,a′), we have:

Remark: the SARSA estimate is updated on the fly as opposed to the model-free Monte Carlo one where the estimate can only be updated at the end of the episode.

41
Q

Model-based Monte Carlo

A

Aims at estimating T(s,a,s′) and Reward(s,a,s′) using Monte Carlo simulation with:

These estimations will be then used to deduce Q-values, including Qπ and Qopt.

Remark: model-based Monte Carlo is said to be off-policy, because the estimation does not depend on the exact policy.

42
Q

MDP - Optimal Policy

A

πopt is defined as being the policy that leads to the optimal values. It is defined by:

∀s,πopt(s)=argmaxa∈ Actions(s)Qopt(s,a)

43
Q

MDP - Optimal Value

A

Vopt(s) of state s is defined as being the maximum value attained by any policy. It is computed as follows:

Vopt(s)=maxa∈ Actions(s)Qopt(s,a)

44
Q

Model-free Monte Carlo

A

Aims at directly estimating Qπ, as follows:

where ut denotes the utility starting at step t of a given episode.

Remark: model-free Monte Carlo is said to be on-policy, because the estimated value is dependent on the policy ππused to generate the data.

45
Q

Unexplored U

A

States not seen yet

46
Q

Breath-first search (BFS)

A

A graph search algorithm that does a level-by-level traversal. We can implement it iteratively with the help of a queue that stores at each step future nodes to be visited. For this algorithm, we can assume action costs to be equal to a constant c⩾0

47
Q

Expectimax

A

For a given state ss, the expectimax value Vexptmax(s) is the maximum expected utility of any agent policy when playing with respect to a fixed and known opponent policy πopp. It is computed as follows:

Remark: expectimax is the analog of value iteration for MDPs.

48
Q

What are the two forces to balance when designing a heuristic?

A
  1. Computational efficiency
  2. Good approximation
49
Q

Structured perceptron

A

The structured perceptron is an algorithm aiming at iteratively learning the cost of each state-action pair. At each step, it:

decreases the estimated cost of each state-action of the true minimizing path y given by the training data,

increases the estimated cost of each state-action of the current predicted path y′ inferred from the learned weights.

Remark: there are several versions of the algorithm, one of which simplifies the problem to only learning the cost of each action aa, and the other parametrizes Cost(s,a) to a feature vector of learnable weights.

50
Q

Search problem ― A search problem is defined with:

A

The objective is to find a path that minimizes the cost.

51
Q

Graph search algorithm summary - By noting N the number of total states, n of which are explored before the end state send, we have:

A