Sequential decision problems Flashcards
What does MDP stand for?
Markov Decision Process
What are the components of a MDP?
- A set of states (with initial state s0)
- A set of actions for eachs state
- A transition model P(s’ | s, a)
- And a reward function R(s)
What is a solution to a sequential decision problem called?
A policy.
Often denoted as π for state s
If the agent has a complete policy, the agent will always know what to do next, in any situation.
What is an optimal policy?
An optimal policy is one that yields the highest expected utility.
We denote that as π*
If the time the robot can move is fixed (fixed sequential problem)
What effect does that have on the optimal policy
Since the time is fixed, the optimal action can change over time.
Lets say the robot has an hour to clean something in a grid. The moves it makes can have a large impact on how well it does its job.
We call this a nonstationary optimal policy
Which is the opposite of stationary optimal policy where
the robot has all the time it needs.
What different ways of utility
(reward types)
are there for stationary prolicies?
Additive rewards
Where we add the rewards for each state together
and
Discounted rewards
Where we can have more focus on current rewards than future
using the discounting factor y (gamma)
For an unbounded MDP
How do we find the optimal strategy?
https://gyazo.com/fba5121adca8cf7912caa7a7b9a8fd80
Start with an initial guess at the utility function, and iteratively refine it.
Using the robot example with following reward function:
https://gyazo.com/964d59c4ba976233baf94cec6bd158ad
By using the Bellman update function
We can do the first iteration
https://gyazo.com/c8d446fc9caee5eb8ff4bec73f8b79d6
And then the second iterasation and so forth:
Does the discount factor have an impact on the bellman algorithm?
Yes at first, but in the end the algorithm will converge to one true optimal policy given enough iterations.
Are there other ways to find an optimal policy?
So far: MDP was completely known. What if transition probabilities (and rewards) are unknown?
We can use reinforcement learning.
learning (to optimize utility) while doing (online learning).
called batch-learning
here we learn from example traces
See figure for general approach
What is the tradeoff between exploration and exploitation?
Exploration is about finding new routes
exploitation is when the robot picks a track and continues from there
see figure