Week 9: Adding a 'World Model' (Only a Brief Outlook) Flashcards by Gitanjali Sharma

RL is often introduced as an

Markov-Decision Process (MP)

How well did you know this?

Not at all

Perfectly

MDP is (2)

a succession of states (e.g. positions on a chess board), in each state the agent chooses and action (leading to a new state), and in each state has an associated immediate reward (can also be zero reward).

State transitions/action choices can be probabilistic.

How well did you know this?

Not at all

Perfectly

MDP If S is set of states called

state space

How well did you know this?

Not at all

Perfectly

MDP A is set of actions called

action space

How well did you know this?

Not at all

Perfectly

MDP Policy:

mapping from states to actions (e.g., given state Si, I choose action j)

How well did you know this?

Not at all

Perfectly

The goal of MDP process is to

find a good “policy” for the decision maker

How well did you know this?

Not at all

Perfectly

At start of RL we don’t know the best policy or the

value function as agent with default policy, performs action, something changes in enviroment and gives rewars and update state

How well did you know this?

Not at all

Perfectly

Set up of model-free RL:

as agent with default policy, performs action, something changes in enviroment and potentially gives reward and update state

How well did you know this?

Not at all

Perfectly

As compared to model-free RL, model-based RL

we assume we know how the enviroment works

How well did you know this?

Not at all

Perfectly

Model-based RL: we assume we know how the enviroment works,

In other words … (2)

given state s and action a there is some (known) probability that I will transition to the state s’ (another state)

This is the ‘probability model’ or ‘world model (i.e., rules of chess dicate the next possible states)

How well did you know this?

Not at all

Perfectly

Since in model-based RL we have proability of state transitions we can

estimate the probability of future reward in state s’

How well did you know this?

Not at all

Perfectly

S’

State prime (next state)

How well did you know this?

Not at all

Perfectly

The goal after finding we estimate the probability of future reward in next state since model-based RL have proabilities of state transitions

(2)

Learn an optimal policy (best choice in each state)
Learn an optimal value function (correctly attributing rewards to state

How well did you know this?

Not at all

Perfectly

The problem to goal - (2)

Take chess, we can not compute all board positions and simply calculate V and P

How well did you know this?

Not at all

Perfectly

The solution to problem

Do it iteratively, looking a few steps ahead

How well did you know this?

Not at all

Perfectly

Using formula of V we change this slightly to (achieve the goal) - (4)

Study These Flashcards

use policy that indicates max reward

separate ro (current reward) from rest

Write equation as recursive form

This is called as Bellman’s equatoion

Formula of V (used to change into Bellman’s equation) is (2)

Study These Flashcards

sum of expected future reward for each possible state by adding up the sum of discounted future reward + current reward (given state s0, i.e., now)

Formula taken from model-free RL

What does it mean of writing expected reinforcement equation given state now as written recursive form (as Bellman’s equation) - (4)

Study These Flashcards

If the value function we are trying to optimize is a recursive version of itself we can break up the problem into smaller individual problems.

Thus, Optimize locally then put things together. It can be shown that global solution is then still optimal.

Also, since the optimal policy is the one that maximizes reward, that means if we have a value function => decision making becomes easy, i.e., we can extract policy: we just pick the action that maximizes V

Similarly, if we have the policy we can create the value function (by picking the next state according to the policy)

We build up V and P in formula of

So if we know rules of game or don’t know optimal policy or value function then

Study These Flashcards

Iteratively build the optimal policy and value function (solution)

We can build the optimal policy and value function - (2)

Study These Flashcards

One way is (somewhat reminiscent of our Q-learning football-kicking robot) to start with a small problem, then work your way back (local -> global) , the robot example, only after we learned to kick the ball when in the field with the ball, could we then in the next learning episode learn to walk into the field with the ball

Simialrly, start with a winning case, then optimise the step before, building up the value function recursively (value iteration)

Two algorithms to solve the problem of building V and P

Study These Flashcards

value iteration and policy iteration

Value iteration is where we

Study These Flashcards

iteratively (episode by peisode) update V for state we go through (a lot like the football robot)

We can think for V for value iteration as

Study These Flashcards

one large table with entires for all possible states (initalised e.g. with 0)

V iteration is different from Q-learning as

Study These Flashcards

we know the probabilities of state transitions P, we can evaluate (1) for all possible actions (with our initial V).

Pick the best action and update current estimate of V (that is V(s)). => perform state transition and repeat the process. And so onwards …

Once you have a good estimate of V: you can extract the optimal policy as the policy that maximises reward

Compare to football-robot , it took actions at random

These formula same

Both the same Bellman's equation with state transitions probabilities P with it (2)

Policy iteration (5)

Rather than finding the action that maximizes V, we lock in a policy at start (our best estimate) Then we build up V. Then iterate through alternative actions to refine the action choice (policy) Then lock in the updated policy, and do another value iteration … This can be faster (in the computer) than value iteration

Week 9: Adding a 'World Model' (Only a Brief Outlook) Flashcards

(26 cards)