Markov Decision Process Flashcards

1
Q

What is reinforcement learning?

A

This is teaching an agent by rewarding it when it does a positive action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 different phases of reinforcement learning?

A

Exploration phase:

> Trying different actions

> Exploring the outcomes of different actions

Credit assignment:

> Assigning an outcome to an award

> This reward needs to come immediately after the desire outcome happens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In the following example, what are the following?

> Agent

> Goal

> Movement

> States

> Actions

> Transitions [Picture 15]

A

> Agent: Robot

> Goal: Treasure

> Movement: Cardinal directions

> States: Each cell

> Actions: Up/Down/Left/Right

> Transitions: How the environment changes as a result of its actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the concept of state?

A

The decisions at a certain point affect the decisions at later points. Time matters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is supervised learning?

A

When the agent has samples of correct answers (for instance the optimal action for certain states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is unsupervised learning?

A

When the recieved feedback on its actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of learning is this?

A

Neither Supervised or unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the formal notation of the Markov Decision Process?

A

〈S, A, T, r〉

S = Set of states

A = Set of actions

T = Transition function

R = Reward function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the markov property?

A

p(s(t+1), rt│st, at, s(t-1) ,a(t-1), … , s0, a0 ) = P(s(t+1), rt│st, at )

The current state and reward is dependent on all the previos states and actions (during the episode)

However for it to be marcovian it must all equal to the previos state and action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the transition function?

A

T(s, a, s’ ) = p(s(t+1) = s’ | st = s, at = a)

This tells us what state will follow an action on a previous state It is probabilistic. It is the probability of the next state conditioned on the previos state and taking a certain action. The probability if i was in state s and took the action a of ending up in s’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the reward function?

A

r(s, a, s’)

This is the reward of transitioning between state s to s’ by taking action a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is having a markov property about?

A

Being able to observe all the necessary details OR remembering the necessary details observed in the past

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the equation for the immediate reward R at time t?

A

Rt = r(st , at, s(t+1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

For episodic tasks, what is the equation for the long term reward? What does it show?

A

Gt ≡ R(t+1) + R(t+2) + ⋯+ RT

This is the total reward from time t to the end of the episode at time T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For infinitely long tasks, what is the equation for the long term reward? What does it show?

A

Gt ≡ R(t+1) + γR(t+2) + γ2R(t+3) + ⋯ Gt ≡ ∑(k=0) γk R(t+k+1)

This is the total reward from time t to infitity. We use an exponential discount factor, γ, therwise Gt would be infinite. Rewards that are futher into the future are discounted more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the range of γ?

A

0 ≤ γ < 1 γ can only ever be 1 in episodic MDPs. This means that the equation: Gt ≡ ∑(k=0) γk R(t+k+1) can be used for bother infinite and episodic tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an absorbing state?

A

This is a state that can never be left and in which ever action returns a reward of zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a behavior?

A

This is a function which for every state returns the action to execute in that state The function is: π(s) = at This function is called a posicy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the function for a policy?

A

π(s) = at

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the function for a policy that is probabilistic?

A

π(a | s) = p(at = a | st = s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do we improve a policy?

A

We improve it by measuring how good the current policy is. We define the value state under a given policy as the expecture return from that state while following the policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the equation for the value state?

A

vπ(s) ≡ Eπ[Gt | St = s] = Eπ [R(t+1)] + γEπ [G(t+1)] = Eπ [Rt] + γv(s(t+1))

23
Q

What is the bellman equation?

A

vπ(s) ≡ Eπ[Gt | St = s] = Eπ [R(t+1)] + γEπ [G(t+1)] = Eπ [Rt] + γv(s(t+1))

24
Q

Evaluate this policy [Picture 16]

A

> For the bottom 3 rows, this policy is useless

> For the top row, it is effective

> Overall the policy is not effective

25
Q

Calculate vπ(A) and vπ(B) for the following state diagram given we alwyas chose action ‘a’ and γ = 0.5

[Picture 17]

A

v_π (B):

> v_π (B) = 10

v_π (A):

> v_π (A) = 2 + γv_π (B)

> v_π (A) = 2 + 0.5 × 10 = 7

26
Q

Calculate vπ(A) and vπ(B) for the following state diagram given the probability of chosing ‘a’ is 0.8 and chosing ‘b’ is 0.2 and γ = 0.5

[Picture 17]

A

vπ(B):

> vπ(B) = 0.8 × 10 + 0.2 × (-20) = 4

vπ(A):

> vπ(A) = 2 + γvπ(B)

> vπ(A) = 2 + 0.5 × 4 = 4

27
Q

Calculate vπ(A) and vπ(B) for the following state diagram given the probability of chosing ‘a’ is 0.8 and chosing ‘b’ is 0.2 and of which chosing bD is 0.6 and bE is 0.4 γ = 0.5

[Picture 18]

A

vπ(B):

> vπ(B) = 0.8 × 10 + 0.2 × (0.6 × (-20) + 0.4 × 50) = 9.6

vπ(A):

> vπ(A) = 2 + γvπ (B) = 6.8

28
Q

What is the generalised value equation?

A

vπ(s) = ∑a π(a | s) [∑(s’) p(s’ | s, a) (r(s, a, s’ ) + γvπ (s’ ))]

a π(a | s) = Probability of taking an action

[∑(s’) p(s’ | s, a) (r(s, a, s’ ) + γvπ (s’ ))] = Consequence of taking that action

(s’) p(s’ | s, a) = Probability of each possible state

(r(s, a, s’ ) + γvπ(s’ )) = Value of each possible state

29
Q

For the following example what are the actions and why are the costs negative?

[Picture 19]

A

Actions:

> Walk

> Bus

> Train

> Drive Costs are negative because we want to minimise the time / maximise the negative time

30
Q

When we want to maximise the reward how do we express this?

A

π* (s) = argmaxa [q*(s, a)]

31
Q

How many optimal policies can there be and are they the same?

A

> An optimal policy is unique

> There can be one or more optimal poicies

> In the following example, if you take either the black or red routes then you are still following an optimal policy

[Picture 20]

32
Q

What is the equation for the value of taking an action?

A

qπ(s, a) = [∑s’ p(s’ | s, a) (r(s, a, s’ ) + γqπ(s’ ))]

33
Q

What is the equation for the value of taking an optimal action (when following an optimal policy)?

A

q*(s, a) = [∑s’ p(s’ | s, a) (r(s, a, s’ ) + γmaxa’ q* (s’))]

34
Q

What is the equation for the update rule?

A

q(k+1) (s, a) = ∑s’ p(s’ | s, a)(r(s, a, s’) + γ maxa’ qk(s’, a’))

35
Q

What is the update rule algorithm?

A

> This is the algorithm that defines dynamic programming

> It uses this update rule over all selection pairs until convergence

36
Q

What is policy iteration?

A

> With AI we want to learn without knowing the probability of a transition system (p(s’ | s, a)) or the return (r(s, a, s’))

> Transitions and rewards are unknown: MDPM=〈S, A, ?, ?〉

> An agent starts with an arbitrary policy (or equivalent action value function)

37
Q

How do we act optimally?

A

> We will have to take an action in the environment and experience the transition and experience the reward and we can us this to learn

38
Q

What are the 2 steps for policy iteration?

A

Step 1: Policy evaluation

Step 2: Policy iteration

39
Q

Evaluate the 4 action for this state? (γ = 0.5)

[Picture 21]

A

q(<4,2>, right) = 0 + γ(-100) = -50

q(<4,2>, up) = 0 + γ(50) = 25

q(<4,2>, left) = 5 + γ^3 (-100) = -17.5

q(<4,2>, down) = 0 + γ^3 (-100) = -12.5

40
Q

For generalised policy iteration, when do we stop iterating?

A

When the action value function is optimal. This happens when the left and right hand sides of the equation,

q(k+1) (s, a) = ∑s’ p(s’ | s, a)( r(s, a, s’ ) + γ maxa’ qk(s’, a’) )

,are the same

41
Q

What is a model?

A

This is a map/model of the environment

42
Q

What is the equation for learning without a model?

A

q(k+1) (s, a) = (1 - α)qk(s, a) + α × target = qk(s, a) + α(target - qk(s, a))

43
Q

Describe the equation for learning without a model?

A

> We take an estimate qk and move one step, α, in the direction of the target

> We start with the estimate and we take a sample of the value we want to estimate (a sample of the target) which helps the update

> The learning rate (α) affects how much we take into account the sample

44
Q

What is the bellman equation for learning without a model?

A

qπ(s, a) = ∑(s’, r) p(s’, r | s, a)(r + γ∑(a’) π(s’ | s) qπ(s’, a’) )

45
Q

How do we calculate the target for learning without a model using Monte Carlo?

A

Gt = R(t+1) + γR(t+2) + γ2 R(t+3) + ⋯

46
Q

What is exploration?

A

Once in a while, it should take a random action that is different to the policy that is under evaluation

47
Q

We want the agent to cover every state action pair, How can this be done?

A

> By doing random restarts

> Exploration

48
Q

What is the exploration exploitation trade off?

A

This is the trade off between the agent diverging from the policy and exploring other possible options vs sticking to the policy and refining it

49
Q

What is ε-greedy for exploration?

A

Probability of taking a random action: ε

Probability of following the policy: (1 - ε)

50
Q

How can we ensure that we explore all states?

A

By having a non-zero ε

51
Q

What are the 2 problems with monte carlo updates?

A

Problem 1: Valid only for episodic tasks

Problem 2: It cannot learn during an episode

52
Q

What can be said about bias and variance for using monte carlo?

A

No bias High variance

53
Q

What does TD method stand for?

A

Temporal Difference Method