Sequential decision problems Flashcards

1
Q

What does MDP stand for?

A

Markov Decision Process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the components of a MDP?

A
  1. A set of states (with initial state s0)
  2. A set of actions for eachs state
  3. A transition model P(s’ | s, a)
  4. And a reward function R(s)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a solution to a sequential decision problem called?

A

A policy.

Often denoted as π for state s

If the agent has a complete policy, the agent will always know what to do next, in any situation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an optimal policy?

A

An optimal policy is one that yields the highest expected utility.

We denote that as π*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If the time the robot can move is fixed (fixed sequential problem)

What effect does that have on the optimal policy

A

Since the time is fixed, the optimal action can change over time.

Lets say the robot has an hour to clean something in a grid. The moves it makes can have a large impact on how well it does its job.

We call this a nonstationary optimal policy

Which is the opposite of stationary optimal policy where

the robot has all the time it needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What different ways of utility

(reward types)

are there for stationary prolicies?

A

Additive rewards

Where we add the rewards for each state together

and

Discounted rewards

Where we can have more focus on current rewards than future

using the discounting factor y (gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

For an unbounded MDP

How do we find the optimal strategy?

A

https://gyazo.com/fba5121adca8cf7912caa7a7b9a8fd80

Start with an initial guess at the utility function, and iteratively refine it.

Using the robot example with following reward function:

https://gyazo.com/964d59c4ba976233baf94cec6bd158ad

By using the Bellman update function

We can do the first iteration

https://gyazo.com/c8d446fc9caee5eb8ff4bec73f8b79d6

And then the second iterasation and so forth:

https://gyazo.com/504f36c38594f474ce20736fb75aed77

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Does the discount factor have an impact on the bellman algorithm?

A

Yes at first, but in the end the algorithm will converge to one true optimal policy given enough iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are there other ways to find an optimal policy?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

So far: MDP was completely known. What if transition probabilities (and rewards) are unknown?

A

We can use reinforcement learning.

learning (to optimize utility) while doing (online learning).

called batch-learning

here we learn from example traces

See figure for general approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the tradeoff between exploration and exploitation?

A

Exploration is about finding new routes

exploitation is when the robot picks a track and continues from there

see figure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly