Week 9: Adding a 'World Model' (Only a Brief Outlook) Flashcards
RL is often introduced as an
Markov-Decision Process (MP)
MDP is (2)
a succession of states (e.g. positions on a chess board), in each state the agent chooses and action (leading to a new state), and in each state has an associated immediate reward (can also be zero reward).
State transitions/action choices can be probabilistic.
MDP If S is set of states called
state space
MDP A is set of actions called
action space
MDP Policy:
mapping from states to actions (e.g., given state Si, I choose action j)
The goal of MDP process is to
find a good “policy” for the decision maker
At start of RL we don’t know the best policy or the
value function as agent with default policy, performs action, something changes in enviroment and gives rewars and update state
Set up of model-free RL:
as agent with default policy, performs action, something changes in enviroment and potentially gives reward and update state
As compared to model-free RL, model-based RL
we assume we know how the enviroment works
Model-based RL: we assume we know how the enviroment works,
In other words … (2)
given state s and action a there is some (known) probability that I will transition to the state s’ (another state)
This is the ‘probability model’ or ‘world model (i.e., rules of chess dicate the next possible states)
Since in model-based RL we have proability of state transitions we can
estimate the probability of future reward in state s’
S’
State prime (next state)
The goal after finding we estimate the probability of future reward in next state since model-based RL have proabilities of state transitions
(2)
- Learn an optimal policy (best choice in each state)
- Learn an optimal value function (correctly attributing rewards to state
The problem to goal - (2)
- Take chess, we can not compute all board positions and simply calculate V and P
The solution to problem
Do it iteratively, looking a few steps ahead