7 - Reinforcement Learning 2 Flashcards

Question 1

Q

Finite Markov Design Processes

Answer

A

Mathematically idealised form of the reinforcement learning problem

eg Graph that uses states (as circles) and actions (as different circles), with rewards (arrows) and probabilities

Question 2

Q

Markov Property

Hint: States depend on…

Answer

A

The next state s’ depends on the current state s and the decision maker’s action.

but, given s and a, s’ is conditionally independent of all previous states.

Question 3

Q

Markov Chains

Answer

A

-Multiple states.
-Agent that transitions.
- time measured in time steps
- set of states eg {happy,hungry, sad}
- Transition function (gives prob of switching from one state to another)

Question 4

Q

Hidden Markov Model

Answer

A

Like a markov chain, but the states are hidden.

For the mood example, an observation function tells us the joint probability of each observation for each state. Eg O(Hungry,Eating) = 0.5
O(Hungry,Crying) = 0.5

Question 5

Q

Markov chain Transition function

Answer

A

Gives the probability of switching from one state to another in a chain

eg T(Happy->Hungry) = 0.4

Question 6

Q

Markov Decision Process, compared to markov chain

Answer

A

Like a markov chain, with addition of actions.

Example T(Happy, Play -> Hungry) = 0.3
Each state can have one or more action

Question 7

Q

Partially Observable MDP

Answer

A

An MDP where states are hidden

Question 8

Q

MDP Agent-environment Interaction

Answer

A

Environment gives state to agent.
Agent gives action
Environment gives reward and next state

What happens next must only depend on our current state

Question 9

Q

MDP Reward hypothesis

Answer

A

Goals and purposes can be thought of as the max of the expected value of the cumulative sum of a received scalar signal (reward)

Question 10

Q

Sum of rewards converges to a finite factor?

Answer

A

Yes,
Gt = Rt+1 + γRt+2 + (γ^2)Rt+3 +…
= sum(k->0 to inf)((γ^k)*Rt+k+1,)

gamma γ, is the discount factor (chance of the reward i think?)

Question 11

Q

Sum of rewards when we know some rewards already

Answer

A

Gt = Rt + γ*Gt+1
(γ is a parameter 0<γ<1 discount rate)

Think of the original sum of rewards expression
Gt = Rt + γGt+1 + (γ^2)Gt+2..
Factorise γ

Question 12

Q

Reward vs Value

Answer

A

Instant reward might be available but the move may be counterproductive in the long term

Question 13

Q

action-value function qπ

From the sum of rewards Gt

Answer

A

q(s,a) = En[Gt|St=s,At=a]

(Gt is the sum of rewards
En - denotes the expected value of a random variable given that the agent
follows policy π, and t is any time step)
We can estimate it from experience

Question 14

Q

Monte Carlo Methods for estimating action-value function

Answer

A

Sample and average returns for each state-action pair (like bandit methods).

Diff is there are multiple states, each acting like a different bandit problem

Question 15

Q

Objective of Monte Carlo Methods

Answer

A

To learn vπ(s).

The value function at state s in policy π

Question 16

Q

Temporal Difference in context of rewards

How to update value function. NOT the definition of temporal difference

Answer

Study These Flashcards

A

Update value function by adding value we already have with step size*difference of real and estimate

V(St) <- V(St) + a[Gt-V(St)]
Gt can be substituted for Rt+γ*V(St+1)

Question 17

Q

TD error

Answer

Study These Flashcards

A

update using this using the difference between estimated S value and the better estimate

V(St) <- V(St)+alpha[Rt+1 + γV(St+1) - V(St)]

It may be helpful to notice this structure is similar to Q learning

7 - Reinforcement Learning 2 Flashcards

(17 cards)