RL: Chapter 3: Finite Markov Decision Processes Flashcards
Markov Decision Process
Agent-Environment Interface
Agent: learner and decision maker
Environment: Everything outside the agent.
They interact continually, the agent selecting actions and the environment respending to these actions and presenting new situations to the agent.
The environment also gives rise to rewards, special numerical values that the agent seeks to maximise over time through its choice of actions.
State, Action, Reward
At each time step t
:
- the agent receives some representation of the environment’s state,
- and on that basis selects an action.
- One step later, in part as a consequence of its action, the agent receives a numerical reward.
Markov decision process
One in which the distribution of the next state and reward depends on the immediately preceding state and action.
Given the previous state and action, it does not depend at all on earlier states and actions.
Reward Hypothesis
All we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called rewards).
Episodic tasks
Tasks with episodes.
Where an episode is a sequence of states ending in a special state, called the terminal state, followed by a reset to a standard starting state.
Continuing tasks
Tasks where the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit.
Value functions
Functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).
The notion of “how good” is defined in terms of expected future rewards.
Policy
A policy is a mapping from states to probabilities of selecting each possible action.
If the agent is following policy π at time t, then π (a|s) is the probability of selecting At = a if St = s.
State-value function for a policy
The value function of a state under a policy.
v_π(s) is the expected return when starting in s and following π thereafter.
Action-value function for a policy
The value of taking an action in a state under a policy.
q_π(s, a) is the expected return starting from s, taking action a, and thereafter following policy π.
Optimal policy
A policy that is better than or equal to all other policies.
A policy π is defined to be better than or equal to a policy π’ if its expected return is greater than or equal to that of π’ for all states.
I.e. π ≥ π’ iff v_π(s) ≥ v_π’(s) for all s ∈ S.
Greedy
The term greedy is used in Computer Science to describe:
Any search or decision procedure that selects alternatives based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better alternatives.
The solution to the Bellman optimality equation depends on 3 assumptions
- The dynamics of the environment are accurately known
- computational resources are sufficient to complete the calculation
- The states have the Markov property.