Markov Decision Process Flashcards
What is reinforcement learning?
This is teaching an agent by rewarding it when it does a positive action
What are the 2 different phases of reinforcement learning?
Exploration phase:
> Trying different actions
> Exploring the outcomes of different actions
Credit assignment:
> Assigning an outcome to an award
> This reward needs to come immediately after the desire outcome happens
In the following example, what are the following?
> Agent
> Goal
> Movement
> States
> Actions
> Transitions [Picture 15]
> Agent: Robot
> Goal: Treasure
> Movement: Cardinal directions
> States: Each cell
> Actions: Up/Down/Left/Right
> Transitions: How the environment changes as a result of its actions
What is the concept of state?
The decisions at a certain point affect the decisions at later points. Time matters
What is supervised learning?
When the agent has samples of correct answers (for instance the optimal action for certain states)
What is unsupervised learning?
When the recieved feedback on its actions
What type of learning is this?
Neither Supervised or unsupervised
What is the formal notation of the Markov Decision Process?
〈S, A, T, r〉
S = Set of states
A = Set of actions
T = Transition function
R = Reward function
What is the markov property?
p(s(t+1), rt│st, at, s(t-1) ,a(t-1), … , s0, a0 ) = P(s(t+1), rt│st, at )
The current state and reward is dependent on all the previos states and actions (during the episode)
However for it to be marcovian it must all equal to the previos state and action
What is the transition function?
T(s, a, s’ ) = p(s(t+1) = s’ | st = s, at = a)
This tells us what state will follow an action on a previous state It is probabilistic. It is the probability of the next state conditioned on the previos state and taking a certain action. The probability if i was in state s and took the action a of ending up in s’
What is the reward function?
r(s, a, s’)
This is the reward of transitioning between state s to s’ by taking action a
What is having a markov property about?
Being able to observe all the necessary details OR remembering the necessary details observed in the past
What is the equation for the immediate reward R at time t?
Rt = r(st , at, s(t+1))
For episodic tasks, what is the equation for the long term reward? What does it show?
Gt ≡ R(t+1) + R(t+2) + ⋯+ RT
This is the total reward from time t to the end of the episode at time T
For infinitely long tasks, what is the equation for the long term reward? What does it show?
Gt ≡ R(t+1) + γR(t+2) + γ2R(t+3) + ⋯ Gt ≡ ∑(k=0)∞ γk R(t+k+1)
This is the total reward from time t to infitity. We use an exponential discount factor, γ, therwise Gt would be infinite. Rewards that are futher into the future are discounted more
What is the range of γ?
0 ≤ γ < 1 γ can only ever be 1 in episodic MDPs. This means that the equation: Gt ≡ ∑(k=0)∞ γk R(t+k+1) can be used for bother infinite and episodic tasks
What is an absorbing state?
This is a state that can never be left and in which ever action returns a reward of zero.
What is a behavior?
This is a function which for every state returns the action to execute in that state The function is: π(s) = at This function is called a posicy
What is the function for a policy?
π(s) = at
What is the function for a policy that is probabilistic?
π(a | s) = p(at = a | st = s)
How do we improve a policy?
We improve it by measuring how good the current policy is. We define the value state under a given policy as the expecture return from that state while following the policy