Reinforcement_Learning Flashcards
- Reinforcement Learning, An Introduction
Reinforcement Learning
Computational approach to understanding and automating goal-directed learning and decision making
Simultaneously:
1. Problems
2. Solution Methods
3. Field of Study (of the given problems and their respective solution methods)
GOAL: Find the optimal policy for a given environment (CONTROL PROBLEM)
Reward
Scalar value/signal, produced by the environment in response to an action taken upon it by the Agent, indicating the immediate/primary feedback of the action by the environment
Primary, Immediate
Numerical value, returned the by environment, that the agent seeks to maximize over time through its choice of actions
Our way of communicating to the agent WHAT we want achieved (not how we want it achieved)
Basic for evaluating the actions the agent decides to take
Model of the Environment
Optional element of the Agent, allows inferences to be made about how the environment will behave. Used for planning
Value
The expected cumulative (usually discounted) reward the Agent would receive starting in the current state and following the current policy. (Secondary to reward). Represents long-term desirability of states.
Planning
Any way of deciding on a course of action first considering possible future situations before they are actually experienced
Model-Free Methods
Methods that DO NOT include the optional model w/in the environment. Exclusive trial-and-error learning
Model-Based Methods
Methods that INCLUDE the optional model w/in the environment and use planning
Main sub-elements of a RL system:
- Policy
- Reward Signal
- Value Function
- Model (Optional)
Tabular Solution Methods
Solution methods where the corresponding environment state and action spaces are small enough for said method to represent the value function as an array/table
Most Important Feature Distinguishing RL from other types of ML?
Uses training information that EVALUATES the actions take rather than INSTRUCTS by giving correct actions
k-Armed Bandit Problem
RL problem where you are faced repeatedly with a choice among k different actions and a single state. After each action you receive a reward chosen from a probability distribution is dependent on the action selected. Objective is to maximize the expected total reward over time
Action-Value Methods
Methods for estimating the values of actions
Action-Selection Methods
Methods to select actions given action-values
Sample-Average
Action-Value Method: each value estimate is an average of the sample of relevant rewards
Greedy Behavior
Selecting the action that has the highest value. We are exploiting our current knowledge of the values of the actions
Nongreedy (Exploratory) Behavior
Selection the actions that do NOT have the highest value. We are exploring because this enables us to improve our estimate of the action’s value
Epsilon-Greedy Action-Selection Method
Select the greedy action most of the time, but every once in a while, with small probability epsilon, instead select randomly from among all actions with equal probability, independent of the action-value estimates
Error
[Target - OldEstimate]
Incremental Update Rule
NewEstimate = OldEstimate + StepSize[Target - OldEstimate]
Stationary Problem
A RL problem where the reward probabilities do NOT change over time
Nonstationary Problem
A RL problem where the reward probabilities DO change over time
Optimistic Initial Values
Exploration method where, when initializing the value function, the values are large (optimistic) as to encourage exploration until the values update to become more realistic
Upper-Confidence-Bound (UCB) Action-Selection Method
For each action, we track the “uncertainty” or variance in the estimate of a’s value.
Exploration is achieved by selecting actions with high uncertainty in order to make all value-estimates certain
Gradient Bandit Algorithms
Instead of learning to estimate the action values for each action, we instead learn a numerical “preference” for each action, which we denote H(a).
The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward - only the relative performance of one action OVER another action is important
Action Probabilities are determined via a soft-max distribution
Soft-Max Distribution
Probability distribution (sums to 1) over a set of mutually exclusive events. Takes a vector of inputs and returns a vector of outputs, where each output represents the probability of the input belonging to a particular output class
Markov Decision Process (MDP)
Classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent states, and through those future rewards.
Mathematically idealized for of the reinforcement learning problem for which precise theoretical statements can be made
Proposes that any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between the agent and the environment (action, reward, states)
Agent
Learner and decision maker of the RL problem
Objective is to maximize the amount of reward it receives over time (return / expected value)
Needs to be able to sense the environment, take action to change the environment, and process received reward
Environment
The entity that the agent interact with, comprising everything outside the agent
Action
A choice made by the agent to take upon the environment, in service of maximizing expected return.
In general, actions can be any decisions we want to learn how to make
Trajectory / Sequence
The list of [State, Action, Reward] per time-step appended with all time-steps taken
Dynamics
p(s’,r|s,a)
Defined discrete probability distributions dependent only on the preceding state and action
The probability of going to state s’ and receiving reward r given the current state, action pair.
State
Representation of the environment to the agent. Returned to the agent in response to an action on the environment during the agent-environment interface (along with the reward)
Must include information about all aspects of the past sequence that makes a difference for the future (Markovian)
In general, can be anything we can know that might be useful in making decisions about which action to take
Basic for how the agent makes decisions
Markov Property
If a state includes all information about all aspects of the past sequence that makes a difference for the future
Agent-Environment Boundary
Boundary between what we consider the agent and what we consider the environment.
The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of the environment
Represents the limit of the agent’s absolute control, not of its knowledge
Agent-Environment Interface
The continuous interaction between the agent and environment where the agent selects actions and the environment responding to these actions by returning a reward and presenting a new state to the agent
Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)
Return
G
Secondary, Delayed
Sum of rewards starting at time-step t and ending at the Terminal State
Episode
A single iteration in the agent-environment loop
Terminal State
A state of an environment that terminates the agent-environment interaction loop
Followed by a reset to a standard starting state
Episodic Task
Any task that can be broken up into episodes
Continuing Task
Any task that is not broken up into episodes
Discounting
A process by which future rewards are lessened in order to give priority to more immediate rewards via the Discounting Rate
Absorbing State
Special state that transitions only to itself and generates only rewards of zero
Used to make continuous tasks finite