Week 9: Adding a 'World Model' (Only a Brief Outlook) Flashcards

1
Q

RL is often introduced as an

A

Markov-Decision Process (MP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MDP is (2)

A

a succession of states (e.g. positions on a chess board), in each state the agent chooses and action (leading to a new state), and in each state has an associated immediate reward (can also be zero reward).

State transitions/action choices can be probabilistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MDP If S is set of states called

A

state space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MDP A is set of actions called

A

action space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MDP Policy:

A

mapping from states to actions (e.g., given state Si, I choose action j)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The goal of MDP process is to

A

find a good “policy” for the decision maker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

At start of RL we don’t know the best policy or the

A

value function as agent with default policy, performs action, something changes in enviroment and gives rewars and update state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Set up of model-free RL:

A

as agent with default policy, performs action, something changes in enviroment and potentially gives reward and update state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

As compared to model-free RL, model-based RL

A

we assume we know how the enviroment works

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Model-based RL: we assume we know how the enviroment works,

In other words … (2)

A

given state s and action a there is some (known) probability that I will transition to the state s’ (another state)

This is the ‘probability model’ or ‘world model (i.e., rules of chess dicate the next possible states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Since in model-based RL we have proability of state transitions we can

A

estimate the probability of future reward in state s’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

S’

A

State prime (next state)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The goal after finding we estimate the probability of future reward in next state since model-based RL have proabilities of state transitions

(2)

A
  1. Learn an optimal policy (best choice in each state)
  2. Learn an optimal value function (correctly attributing rewards to state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The problem to goal - (2)

A
  • Take chess, we can not compute all board positions and simply calculate V and P
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The solution to problem

A

Do it iteratively, looking a few steps ahead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Using formula of V we change this slightly to (achieve the goal) - (4)

A

use policy that indicates max reward

separate ro (current reward) from rest

Write equation as recursive form

This is called as Bellman’s equatoion

17
Q

Formula of V (used to change into Bellman’s equation) is (2)

A

sum of expected future reward for each possible state by adding up the sum of discounted future reward + current reward (given state s0, i.e., now)

Formula taken from model-free RL

18
Q

What does it mean of writing expected reinforcement equation given state now as written recursive form (as Bellman’s equation) - (4)

A

If the value function we are trying to optimize is a recursive version of itself we can break up the problem into smaller individual problems.

Thus, Optimize locally then put things together. It can be shown that global solution is then still optimal.

Also, since the optimal policy is the one that maximizes reward, that means if we have a value function => decision making becomes easy, i.e., we can extract policy: we just pick the action that maximizes V

Similarly, if we have the policy we can create the value function (by picking the next state according to the policy)

19
Q

We build up V and P in formula of

So if we know rules of game or don’t know optimal policy or value function then

A

Iteratively build the optimal policy and value function (solution)

20
Q

We can build the optimal policy and value function - (2)

A

One way is (somewhat reminiscent of our Q-learning football-kicking robot) to start with a small problem, then work your way back (local -> global) , the robot example, only after we learned to kick the ball when in the field with the ball, could we then in the next learning episode learn to walk into the field with the ball

Simialrly, start with a winning case, then optimise the step before, building up the value function recursively (value iteration)

21
Q

Two algorithms to solve the problem of building V and P

A

value iteration and policy iteration

22
Q

Value iteration is where we

A

iteratively (episode by peisode) update V for state we go through (a lot like the football robot)

23
Q

We can think for V for value iteration as

A

one large table with entires for all possible states (initalised e.g. with 0)

24
Q

V iteration is different from Q-learning as

A

we know the probabilities of state transitions P, we can evaluate (1) for all possible actions (with our initial V).

Pick the best action and update current estimate of V (that is V(s)). => perform state transition and repeat the process. And so onwards …

Once you have a good estimate of V: you can extract the optimal policy as the policy that maximises reward

Compare to football-robot , it took actions at random

25
Q

These formula same

A

Both the same

Bellman’s equation with state transitions probabilities P with it (2)

26
Q

Policy iteration (5)

A

Rather than finding the action that maximizes V, we lock in a policy at start (our best estimate)

Then we build up V.

Then iterate through alternative actions to refine the action choice (policy)

Then lock in the updated policy, and do another value iteration …

This can be faster (in the computer) than value iteration