Sequential decision problems Flashcards

Question 1

Q

What does MDP stand for?

Answer

A

Markov Decision Process

Question 2

Q

What are the components of a MDP?

Answer

A

A set of states (with initial state s₀)
A set of actions for eachs state
A transition model P(s’ | s, a)
And a reward function R(s)

Question 3

Q

What is a solution to a sequential decision problem called?

Answer

A

A policy.

Often denoted as π for state s

If the agent has a complete policy, the agent will always know what to do next, in any situation.

Question 4

Q

What is an optimal policy?

Answer

A

An optimal policy is one that yields the highest expected utility.

We denote that as π*

Question 5

Q

If the time the robot can move is fixed (fixed sequential problem)

What effect does that have on the optimal policy

Answer

A

Since the time is fixed, the optimal action can change over time.

Lets say the robot has an hour to clean something in a grid. The moves it makes can have a large impact on how well it does its job.

We call this a nonstationary optimal policy

Which is the opposite of stationary optimal policy where

the robot has all the time it needs.

Question 6

Q

What different ways of utility

(reward types)

are there for stationary prolicies?

Answer

A

Additive rewards

Where we add the rewards for each state together

and

Discounted rewards

Where we can have more focus on current rewards than future

using the discounting factor y (gamma)

Question 7

Q

For an unbounded MDP

How do we find the optimal strategy?

Answer

A

https://gyazo.com/fba5121adca8cf7912caa7a7b9a8fd80

Start with an initial guess at the utility function, and iteratively refine it.

Using the robot example with following reward function:

https://gyazo.com/964d59c4ba976233baf94cec6bd158ad

By using the Bellman update function

We can do the first iteration

https://gyazo.com/c8d446fc9caee5eb8ff4bec73f8b79d6

And then the second iterasation and so forth:

https://gyazo.com/504f36c38594f474ce20736fb75aed77

Question 8

Q

Does the discount factor have an impact on the bellman algorithm?

Answer

A

Yes at first, but in the end the algorithm will converge to one true optimal policy given enough iterations.

Question 9

Q

Are there other ways to find an optimal policy?

Answer

A

Yes!

By doing policy iteration

https://gyazo.com/86d5856008aaf31d13bfa1216807cd21

Question 10

Q

So far: MDP was completely known. What if transition probabilities (and rewards) are unknown?

Answer

A

We can use reinforcement learning.

learning (to optimize utility) while doing (online learning).

called batch-learning

here we learn from example traces

See figure for general approach

Question 11

Q

What is the tradeoff between exploration and exploitation?

Answer

A

Exploration is about finding new routes

exploitation is when the robot picks a track and continues from there

see figure

Sequential decision problems Flashcards

(11 cards)