Markov Decision Process Flashcards by Henry Triff

What is reinforcement learning?

This is teaching an agent by rewarding it when it does a positive action

How well did you know this?

Not at all

Perfectly

What are the 2 different phases of reinforcement learning?

Exploration phase:

> Trying different actions

> Exploring the outcomes of different actions

Credit assignment:

> Assigning an outcome to an award

> This reward needs to come immediately after the desire outcome happens

How well did you know this?

Not at all

Perfectly

In the following example, what are the following?

> Agent

> Goal

> Movement

> States

> Actions

> Transitions [Picture 15]

> Agent: Robot

> Goal: Treasure

> Movement: Cardinal directions

> States: Each cell

> Actions: Up/Down/Left/Right

> Transitions: How the environment changes as a result of its actions

How well did you know this?

Not at all

Perfectly

What is the concept of state?

The decisions at a certain point affect the decisions at later points. Time matters

How well did you know this?

Not at all

Perfectly

What is supervised learning?

When the agent has samples of correct answers (for instance the optimal action for certain states)

How well did you know this?

Not at all

Perfectly

What is unsupervised learning?

When the recieved feedback on its actions

How well did you know this?

Not at all

Perfectly

What type of learning is this?

Neither Supervised or unsupervised

How well did you know this?

Not at all

Perfectly

What is the formal notation of the Markov Decision Process?

〈S, A, T, r〉

S = Set of states

A = Set of actions

T = Transition function

R = Reward function

How well did you know this?

Not at all

Perfectly

What is the markov property?

p(s_(t+1), r_t│s_t, a_t, s_(t-1) ,a_(t-1), … , s₀, a₀ ) = P(s_(t+1), r_t│s_t, a_t )

The current state and reward is dependent on all the previos states and actions (during the episode)

However for it to be marcovian it must all equal to the previos state and action

How well did you know this?

Not at all

Perfectly

What is the transition function?

T(s, a, s’ ) = p(s_(t+1) = s’ | s_t= s, a_t= a)

This tells us what state will follow an action on a previous state It is probabilistic. It is the probability of the next state conditioned on the previos state and taking a certain action. The probability if i was in state s and took the action a of ending up in s’

How well did you know this?

Not at all

Perfectly

What is the reward function?

r(s, a, s’)

This is the reward of transitioning between state s to s’ by taking action a

How well did you know this?

Not at all

Perfectly

What is having a markov property about?

Being able to observe all the necessary details OR remembering the necessary details observed in the past

How well did you know this?

Not at all

Perfectly

What is the equation for the immediate reward R at time t?

R_t = r(s_t, a_t, s_(t+1))

How well did you know this?

Not at all

Perfectly

For episodic tasks, what is the equation for the long term reward? What does it show?

G_t ≡ R_(t+1) + R_(t+2) + ⋯+ R_T

This is the total reward from time t to the end of the episode at time T

How well did you know this?

Not at all

Perfectly

For infinitely long tasks, what is the equation for the long term reward? What does it show?

G_t ≡ R_(t+1) + γR_(t+2) + γ²R_(t+3) + ⋯ G_t ≡ ∑_(k=0)^∞ γ^k R_(t+k+1)

This is the total reward from time t to infitity. We use an exponential discount factor, γ, therwise G_t would be infinite. Rewards that are futher into the future are discounted more

How well did you know this?

Not at all

Perfectly

What is the range of γ?

0 ≤ γ < 1 γ can only ever be 1 in episodic MDPs. This means that the equation: G_t ≡ ∑_(k=0)^∞ γ^k R_(t+k+1) can be used for bother infinite and episodic tasks

How well did you know this?

Not at all

Perfectly

What is an absorbing state?

This is a state that can never be left and in which ever action returns a reward of zero.

How well did you know this?

Not at all

Perfectly

What is a behavior?

This is a function which for every state returns the action to execute in that state The function is: π(s) = a_t This function is called a posicy

How well did you know this?

Not at all

Perfectly

What is the function for a policy?

π(s) = a_t

How well did you know this?

Not at all

Perfectly

What is the function for a policy that is probabilistic?

π(a | s) = p(a_t = a | s_t = s)

How well did you know this?

Not at all

Perfectly

How do we improve a policy?

We improve it by measuring how good the current policy is. We define the value state under a given policy as the expecture return from that state while following the policy

How well did you know this?

Not at all

Perfectly

What is the equation for the value state?

Study These Flashcards

v_π(s) ≡ E_π[G_t| S_t = s] = E_π [R_(t+1)] + γE_π [G_(t+1)] = E_π [R_t] + γv(s_(t+1))

What is the bellman equation?

Study These Flashcards

v_π(s) ≡ E_π[G_t | S_t = s] = E_π [R_(t+1)] + γE_π [G_(t+1)] = E_π [R_t] + γv(s_(t+1))

Evaluate this policy [Picture 16]

Study These Flashcards

> For the bottom 3 rows, this policy is useless

> For the top row, it is effective

> Overall the policy is not effective

Calculate v_π(A) and v_π(B) for the following state diagram given we alwyas chose action 'a' and γ = 0.5 [Picture 17]

v\_π (B): \> v\_π (B) = 10 v\_π (A): \> v\_π (A) = 2 + γv\_π (B) \> v\_π (A) = 2 + 0.5 × 10 = 7

Calculate v_π(A) and v_π(B) for the following state diagram given the probability of chosing 'a' is 0.8 and chosing 'b' is 0.2 and γ = 0.5 [Picture 17]

v_π(B): \> v_π(B) = 0.8 × 10 + 0.2 × (-20) = 4 v_π(A): \> v_π(A) = 2 + γv_π(B) \> v_π(A) = 2 + 0.5 × 4 = 4

Calculate v_π(A) and v_π(B) for the following state diagram given the probability of chosing 'a' is 0.8 and chosing 'b' is 0.2 and of which chosing bD is 0.6 and bE is 0.4 γ = 0.5 [Picture 18]

v_π(B): \> v_π(B) = 0.8 × 10 + 0.2 × (0.6 × (-20) + 0.4 × 50) = 9.6 v_π(A): \> v_π(A) = 2 + γv_π (B) = 6.8

What is the generalised value equation?

v_π(s) = ∑_a π(a | s) [∑_(s') p(s' | s, a) (r(s, a, s' ) + γv_π (s' ))] ∑_a π(a | s) = Probability of taking an action [∑_(s') p(s' | s, a) (r(s, a, s' ) + γv_π (s' ))] = Consequence of taking that action ∑_(s') p(s' | s, a) = Probability of each possible state (r(s, a, s' ) + γv_π(s' )) = Value of each possible state

For the following example what are the actions and why are the costs negative? [Picture 19]

Actions: \> Walk \> Bus \> Train \> Drive Costs are negative because we want to minimise the time / maximise the negative time

When we want to maximise the reward how do we express this?

π\* (s) = argmax_a [q\*(s, a)]

How many optimal policies can there be and are they the same?

\> An optimal policy is unique \> There can be one or more optimal poicies \> In the following example, if you take either the black or red routes then you are still following an optimal policy [Picture 20]

What is the equation for the value of taking an action?

q_π(s, a) = [∑_s' p(s' | s, a) (r(s, a, s' ) + γq_π(s' ))]

What is the equation for the value of taking an optimal action (when following an optimal policy)?

q\*(s, a) = [∑_s' p(s' | s, a) (r(s, a, s' ) + γmax_a' q\* (s'))]

What is the equation for the update rule?

q_(k+1) (s, a) = ∑_s' p(s' | s, a)(r(s, a, s') + γ max_a' q_k(s', a'))

What is the update rule algorithm?

\> This is the algorithm that defines dynamic programming \> It uses this update rule over all selection pairs until convergence

What is policy iteration?

\> With AI we want to learn without knowing the probability of a transition system (p(s' | s, a)) or the return (r(s, a, s')) \> Transitions and rewards are unknown: MDPM=〈S, A, ?, ?〉 \> An agent starts with an arbitrary policy (or equivalent action value function)

How do we act optimally?

\> We will have to take an action in the environment and experience the transition and experience the reward and we can us this to learn

What are the 2 steps for policy iteration?

Step 1: Policy evaluation Step 2: Policy iteration

Evaluate the 4 action for this state? (γ = 0.5) [Picture 21]

q(\<4,2\>, right) = 0 + γ(-100) = -50 q(\<4,2\>, up) = 0 + γ(50) = 25 q(\<4,2\>, left) = 5 + γ^3 (-100) = -17.5 q(\<4,2\>, down) = 0 + γ^3 (-100) = -12.5

For generalised policy iteration, when do we stop iterating?

When the action value function is optimal. This happens when the left and right hand sides of the equation, q_(k+1) (s, a) = ∑_s' p(s' | s, a)( r(s, a, s' ) + γ max_a' q_k(s', a') ) ,are the same

What is a model?

This is a map/model of the environment

What is the equation for learning without a model?

q_(k+1) (s, a) = (1 - α)q_k(s, a) + α × target = q_k(s, a) + α(target - q_k(s, a))

Describe the equation for learning without a model?

\> We take an estimate q_k and move one step, α, in the direction of the target \> We start with the estimate and we take a sample of the value we want to estimate (a sample of the target) which helps the update \> The learning rate (α) affects how much we take into account the sample

What is the bellman equation for learning without a model?

q_π(s, a) = ∑_{(s', r)} p(s', r | s, a)(r + γ∑_(a') π(s' | s) q_π(s', a') )

How do we calculate the target for learning without a model using Monte Carlo?

G_t = R_(t+1) + γR_(t+2) + γ² R_(t+3) + ⋯

What is exploration?

Once in a while, it should take a random action that is different to the policy that is under evaluation

We want the agent to cover every state action pair, How can this be done?

\> By doing random restarts \> Exploration

What is the exploration exploitation trade off?

This is the trade off between the agent diverging from the policy and exploring other possible options vs sticking to the policy and refining it

What is ε-greedy for exploration?

Probability of taking a random action: ε Probability of following the policy: (1 - ε)

How can we ensure that we explore all states?

By having a non-zero ε

What are the 2 problems with monte carlo updates?

Problem 1: Valid only for episodic tasks Problem 2: It cannot learn during an episode

What can be said about bias and variance for using monte carlo?

No bias High variance

What does TD method stand for?

Temporal Difference Method

Markov Decision Process Flashcards

(53 cards)