7 - Reinforcement Learning 2 Flashcards
Finite Markov Design Processes
Mathematically idealised form of the reinforcement learning problem
eg Graph that uses states (as circles) and actions (as different circles), with rewards (arrows) and probabilities
Markov Property
Hint: States depend on…
The next state s’ depends on the current state s and the decision maker’s action.
but, given s and a, s’ is conditionally independent of all previous states.
Markov Chains
-Multiple states.
-Agent that transitions.
- time measured in time steps
- set of states eg {happy,hungry, sad}
- Transition function (gives prob of switching from one state to another)
Hidden Markov Model
Like a markov chain, but the states are hidden.
For the mood example, an observation function tells us the joint probability of each observation for each state. Eg O(Hungry,Eating) = 0.5
O(Hungry,Crying) = 0.5
Markov chain Transition function
Gives the probability of switching from one state to another in a chain
eg T(Happy->Hungry) = 0.4
Markov Decision Process, compared to markov chain
Like a markov chain, with addition of actions.
Example T(Happy, Play -> Hungry) = 0.3
Each state can have one or more action
Partially Observable MDP
An MDP where states are hidden
MDP Agent-environment Interaction
Environment gives state to agent.
Agent gives action
Environment gives reward and next state
What happens next must only depend on our current state
MDP Reward hypothesis
Goals and purposes can be thought of as the max of the expected value of the cumulative sum of a received scalar signal (reward)
Sum of rewards converges to a finite factor?
Yes,
Gt = Rt+1 + γRt+2 + (γ^2)Rt+3 +…
= sum(k->0 to inf)((γ^k)*Rt+k+1,)
gamma γ, is the discount factor (chance of the reward i think?)
Sum of rewards when we know some rewards already
Gt = Rt + γ*Gt+1
(γ is a parameter 0<γ<1 discount rate)
Think of the original sum of rewards expression
Gt = Rt + γGt+1 + (γ^2)Gt+2..
Factorise γ
Reward vs Value
Instant reward might be available but the move may be counterproductive in the long term
action-value function qπ
From the sum of rewards Gt
q(s,a) = En[Gt|St=s,At=a]
(Gt is the sum of rewards
En - denotes the expected value of a random variable given that the agent
follows policy π, and t is any time step)
We can estimate it from experience
Monte Carlo Methods for estimating action-value function
Sample and average returns for each state-action pair (like bandit methods).
Diff is there are multiple states, each acting like a different bandit problem
Objective of Monte Carlo Methods
To learn vπ(s).
The value function at state s in policy π