Reinforcement_Learning Flashcards

- Reinforcement Learning, An Introduction

1
Q

Reinforcement Learning

A

Computational approach to understanding and automating goal-directed learning and decision making

Simultaneously:
1. Problems
2. Solution Methods
3. Field of Study (of the given problems and their respective solution methods)

GOAL: Find the optimal policy for a given environment (CONTROL PROBLEM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reward

A

Scalar value/signal, produced by the environment in response to an action taken upon it by the Agent, indicating the immediate/primary feedback of the action by the environment

Primary, Immediate

Numerical value, returned the by environment, that the agent seeks to maximize over time through its choice of actions

Our way of communicating to the agent WHAT we want achieved (not how we want it achieved)

Basic for evaluating the actions the agent decides to take

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model of the Environment

A

Optional element of the Agent, allows inferences to be made about how the environment will behave. Used for planning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Value

A

The expected cumulative (usually discounted) reward the Agent would receive starting in the current state and following the current policy. (Secondary to reward). Represents long-term desirability of states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Planning

A

Any way of deciding on a course of action first considering possible future situations before they are actually experienced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model-Free Methods

A

Methods that DO NOT include the optional model w/in the environment. Exclusive trial-and-error learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model-Based Methods

A

Methods that INCLUDE the optional model w/in the environment and use planning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Main sub-elements of a RL system:

A
  1. Policy
  2. Reward Signal
  3. Value Function
  4. Model (Optional)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tabular Solution Methods

A

Solution methods where the corresponding environment state and action spaces are small enough for said method to represent the value function as an array/table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Most Important Feature Distinguishing RL from other types of ML?

A

Uses training information that EVALUATES the actions take rather than INSTRUCTS by giving correct actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

k-Armed Bandit Problem

A

RL problem where you are faced repeatedly with a choice among k different actions and a single state. After each action you receive a reward chosen from a probability distribution is dependent on the action selected. Objective is to maximize the expected total reward over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Action-Value Methods

A

Methods for estimating the values of actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Action-Selection Methods

A

Methods to select actions given action-values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sample-Average

A

Action-Value Method: each value estimate is an average of the sample of relevant rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Greedy Behavior

A

Selecting the action that has the highest value. We are exploiting our current knowledge of the values of the actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Nongreedy (Exploratory) Behavior

A

Selection the actions that do NOT have the highest value. We are exploring because this enables us to improve our estimate of the action’s value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Epsilon-Greedy Action-Selection Method

A

Select the greedy action most of the time, but every once in a while, with small probability epsilon, instead select randomly from among all actions with equal probability, independent of the action-value estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Error

A

[Target - OldEstimate]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Incremental Update Rule

A

NewEstimate = OldEstimate + StepSize[Target - OldEstimate]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Stationary Problem

A

A RL problem where the reward probabilities do NOT change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Nonstationary Problem

A

A RL problem where the reward probabilities DO change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Optimistic Initial Values

A

Exploration method where, when initializing the value function, the values are large (optimistic) as to encourage exploration until the values update to become more realistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Upper-Confidence-Bound (UCB) Action-Selection Method

A

For each action, we track the “uncertainty” or variance in the estimate of a’s value.

Exploration is achieved by selecting actions with high uncertainty in order to make all value-estimates certain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Gradient Bandit Algorithms

A

Instead of learning to estimate the action values for each action, we instead learn a numerical “preference” for each action, which we denote H(a).

The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward - only the relative performance of one action OVER another action is important

Action Probabilities are determined via a soft-max distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Soft-Max Distribution

A

Probability distribution (sums to 1) over a set of mutually exclusive events. Takes a vector of inputs and returns a vector of outputs, where each output represents the probability of the input belonging to a particular output class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Markov Decision Process (MDP)

A

Classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent states, and through those future rewards.

Mathematically idealized for of the reinforcement learning problem for which precise theoretical statements can be made

Proposes that any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between the agent and the environment (action, reward, states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Agent

A

Learner and decision maker of the RL problem

Objective is to maximize the amount of reward it receives over time (return / expected value)

Needs to be able to sense the environment, take action to change the environment, and process received reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Environment

A

The entity that the agent interact with, comprising everything outside the agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Action

A

A choice made by the agent to take upon the environment, in service of maximizing expected return.

In general, actions can be any decisions we want to learn how to make

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Trajectory / Sequence

A

The list of [State, Action, Reward] per time-step appended with all time-steps taken

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Dynamics

A

p(s’,r|s,a)

Defined discrete probability distributions dependent only on the preceding state and action

The probability of going to state s’ and receiving reward r given the current state, action pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

State

A

Representation of the environment to the agent. Returned to the agent in response to an action on the environment during the agent-environment interface (along with the reward)

Must include information about all aspects of the past sequence that makes a difference for the future (Markovian)

In general, can be anything we can know that might be useful in making decisions about which action to take

Basic for how the agent makes decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Markov Property

A

If a state includes all information about all aspects of the past sequence that makes a difference for the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Agent-Environment Boundary

A

Boundary between what we consider the agent and what we consider the environment.

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of the environment

Represents the limit of the agent’s absolute control, not of its knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Agent-Environment Interface

A

The continuous interaction between the agent and environment where the agent selects actions and the environment responding to these actions by returning a reward and presenting a new state to the agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Reward Hypothesis

A

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Return

A

G

Secondary, Delayed

Sum of rewards starting at time-step t and ending at the Terminal State

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Episode

A

A single iteration in the agent-environment loop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Terminal State

A

A state of an environment that terminates the agent-environment interaction loop

Followed by a reset to a standard starting state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Episodic Task

A

Any task that can be broken up into episodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Continuing Task

A

Any task that is not broken up into episodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Discounting

A

A process by which future rewards are lessened in order to give priority to more immediate rewards via the Discounting Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Absorbing State

A

Special state that transitions only to itself and generates only rewards of zero

Used to make continuous tasks finite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Value Function

A

v(s) or q(s,a)

Estimate how good it is for an agent to be in a given state or preform a given action in a given state

Expected return when starting in state s and following policy pi thereafter

= E[G_t|S_t = s]

45
Q

Policy

A

pi(s)

Mapping from states to PROBABILITIES of selecting each possible action

Subcomponent of the agent

Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience

46
Q

Backup Diagrams

A

Diagram relationships that form the basis of the update or backup operations that are at the heart of reinforcement learning methods

Assist in providing graphical summaries of the RL algorithms

47
Q

Optimal Policy

A

pi_*

The policy that is better than or equal to all other policies

48
Q

Tabular Reinforcement Learning

A

RL problems that use small, finite state sets that are modeled using arrays/tables with one entry for each state (or state-action pair)

49
Q

Approximated Reinforcement Learning

A

RL problems that have state spaces too large for arrays/tables and must be approximated using some sore of more compact parameterized function representation

50
Q

Complete Knowledge

A

The agent has a complete an accurate model of the environment’s dynamics (p)

51
Q

Bellman Equations

A

Expresses a recursive relationship between the current value of a state and the values of its successor states

States that the value of a state must equal the discounted value of the expected next state, plus the reward expected along the way

52
Q

Discounting Rate

A

0 <= gamma <= 1

Determines the present value of future rewards: a reward received k time steps in the future is only worth y^(k-1) times what is would be worth if it were received immediately

0 = myopic = prioritize only immediate rewards

1 = farsighted = prioritize future rewards equal to immediate rewards

53
Q

Backup Operations

A

Operations that “transfer” value information back to previous states (or state-action pairs) from successor states (or state-action pairs)

54
Q

Optimal Value Function

A

v_(s) or q_(s,a)

The value-function shared by the all Optimal Policies

55
Q

Bellman Optimality Equations

A

Expresses that the value of a state under an optimal policy must equal the expected return for the action from that state

56
Q

Dynamic Programming

A

Refers to a collection of algorithms that can be used to compute optimal policies GIVEN A PERFECT MODEL OF THE ENVIRONMENT as a MDP

We use a value function to organize and structure the search for good policies

DP algorithms are obtained by turning Bellman Equations into assignments = into update rules for improving approximations of the desired value functions

Depth = Minimum (1)
Width = Maximum (Expected Updates)

One-Step Expected-Update Method

57
Q

Policy Evaluation (Prediction Problem)

A

Process of producing a value function given an input policy

58
Q

Iterative Policy Evaluation

A

Taking from 1 to many iterations of policy evaluation for a given policy

59
Q

Expected Update

A

Replacing the old value of s with a new value of s obtained from the old values of the successor states of s, and the expected immediate rewards, along with the one-step transitions possible under the policy being evaluated

“Expected” because they are based on an expectation over all possible next states rather than on a sample next state

60
Q

Sweep

A

Visiting every state in the state space in a deterministic order

61
Q

Policy Improvement

A

Process of producing a policy given an input value function

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy

62
Q

Policy Iteration

A

Iteration between:

  1. Full Policy Evaluation (go until the value function does not change upon update)

followed by

  1. Full Policy Improvement (go until the policy does not change upon update)
63
Q

Value Iteration

A

Similar to Policy Iteration however, Iteration between:

  1. Single sweep of Policy Evaluation

followed by

  1. Single sweep of Policy Improvement
64
Q

Asynchronous Dynamic Programming

A

Same as Dynamic Programming but instead of updating states in sweeps, algorithm updates the values of the states in any order whatsoever, using whatever values of other states happen to be available

65
Q

Generalized Policy Iteration

A

General idea of letting policy-evaluation and policy-improvement processes interact, independent of the granularity and other details of the two processes

66
Q

Monte Carlo Methods

A

Ways of solving the reinforcement learning problem based on averaging sample returns

Average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value

Estimates for each state are independent, and bootstrapping does not occur

Provide an alternative policy evaluation process (compared to DP). Rather than use a model to COMPUTE the value of each state, they simply average many returns that start in the state.

67
Q

Experience (RL)

A

Sample sequences of sates, actions, and rewards from actual or simulated interaction with an environment

68
Q

First-Visit Monte Carlo

A

Estimates value of state as an average of the returns following FIRST visits to s

69
Q

Every-Visit Monte Carlo

A

Estimates value of state as an average of the returns following EVERY visit to s

70
Q

Exploring Starts

A

A method of exploration where we start our environment with a random state-action pair thereby guaranteeing that all state-action pairs will be visited

71
Q

Control Problem

A

The problem of finding the optimal policy for a given environment

72
Q

On-Policy Methods

A

Attempt to evaluate and improve the policy that is used to make decisions

73
Q

Off-Policy Methods

A

Attempt to evaluate or improve a policy different from that used to generate the data

Learning is from data “off” the target policy

74
Q

Soft vs Hard Policy

A

Soft meaning that all actions have a probability of being selected (stochastic)

Hard meaning that there is a single action that is best/optimal (deterministic)

75
Q

Target Policy

A

Policy being learned

76
Q

Behavior Policy

A

Policy used to generate behavior

77
Q

Importance Sampling

A

A general technique for estimating expected values under one distribution given samples from another distribution

78
Q

Importance-Sampling Ratio

A

Rho (p)

Weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies

79
Q

Ordinary Importance Sampling

A

Importance Sampling done as a simple average

80
Q

Weighted Importance Sampling

A

Importance Sampling done as a weighted average

81
Q

Coverage

A

The requirement in off-policy learning that every action taken under the target policy (pi) is also taken, at least occasionally, under the behavior policy (b)

82
Q

Trial-and-Error Search

A

TODO

83
Q

Delayed Reward

A

TODO

84
Q

Exploration

A

Taking actions to better understand environment

85
Q

Exploitation

A

Taking actions that are known and generate the maximum return currently understood

86
Q

Exploration-Exploitation Dilemma

A

In order to exploit more, we need to explore but if we are exploring, we are not exploiting.

We can either choose to explore or exploit but we cannot do both

87
Q

Weak Solution Methods

A

Solution methods based on general principles, such as searching or learning

88
Q

Strong Solution Methods

A

Solution methods based on specific domain knowledge

89
Q

Q: What do all algorithms in Part I have in common?

A
  1. Seek to estimate value function
  2. Operate by backing up values along actual or possible state transitions
  3. Follow general strategy of Generalized Policy Iteration (GPI)
90
Q

Q: On what two dimensions do the algorithms in Part I differ and what do they mean?

A

Both dimensions describe the kind of update used to improve the value function.

  1. Depth of Update
    - The degree of bootstrapping
  2. Width of Update
    - Are sample updates used or expected updates?
  • Expected Updates = exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities.
  • Sample Updates = Approximate of the expected value by averaging over a sample of trajectories
91
Q

Expected Updates

A

= Exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities.

Ex. Used in DP

92
Q

Sample Updates

A

= Approximate of the expected value by averaging over a sample of trajectories.

Ex. Used in MC and TD

93
Q

Q: What is the difference between Expected and Sample updates?

A

Expected = Exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities.

Sample = Approximate of the expected value by averaging over a sample of trajectories

94
Q

Classify “Dynamic Programming” along with DEPTH and WIDTH dimensions

A

Depth = Minimum (1)
Width = Maximum (Expected Updates)

One-Step Expected-Update Method

95
Q

Classify “Monte Carlo Methods” along with DEPTH and WIDTH dimensions

A

Depth = Maximum (No Bootstrapping)

Width = Minimum (Sample Updates)

Non-bootstrapping Sample-Update Method

96
Q

Classify “Temporal-Difference Learning” along with DEPTH and WIDTH dimensions

A

Depth = Minimum (1-step Bootstrapping)

Width = Minimum (Sample Update)

1-step Bootstrapping Sample-Update Method

97
Q

Classify “TD(n)” along with DEPTH and WIDTH dimensions

A

TD(n=0) = Temporal-Difference

Depth = Minimum (1-step Bootstrapping)

Width = Minimum (Sample Update)

1-step Bootstrapping Sample-Update Method

TD(n=∞) = Monte-Carlo

Depth = Maximum (No Bootstrapping)

Width = Minimum (Sample Updates)

Non-bootstrapping Sample-Update Method

98
Q

Classify “TD(λ)” along with DEPTH and WIDTH dimensions

A

TD(λ=0) = Temporal-Difference

Depth = Minimum (1-step Bootstrapping)

Width = Minimum (Sample Update)

1-step Bootstrapping Sample-Update Method

TD(λ=1) = Monte-Carlo

Depth = Maximum (No Bootstrapping)

Width = Minimum (Sample Updates)

Non-bootstrapping Sample-Update Method

99
Q

Classify “Exhaustive Search” along with DEPTH and WIDTH dimensions

A

Depth = Maximum (No Bootstrapping)
Width = Maximum (Expected Updates)

100
Q

Q: What is a solution method?

A

A way of structuring the search for the optimal policy

101
Q

Q: What are the 4 advantages of MC over TD?

A
  1. Can learn optimal behavior directly from interaction with environment, no model of environment dynamics needed
  2. Can be used with simulation or “sample models”
  3. Easy and efficient to focus on a small subset of states. Do not NEED to sweep over ALL states.
  4. Less harmed by violations of the Markov property because MC does not update value estimates on the basis of the value estimates of successor states (no bootstrapping, unlike DP and TD)
102
Q

Q: If a model is not available, why is it useful to estimate action values rather than state values?

A

Without a model, we don’t know which action given our state will produce the highest reward. Previously, this was dictated by our environment dynamics and could be computed directly. How, we must learn through sample updates and LEARN this for ourselves.

103
Q

Maximization Bias (6.7)

A

TODO

104
Q

Double Learning (6.7)

A

TODO

105
Q

Q: What is the difference between the environment state, observation, and agent state?

A

TODO

106
Q

Q: Does MC or TD have lower bias? Why?

A

MC has lower bias (it is unbiased) as its estimate is literally the expectation of the value while TD bootstraps and therefore introduces bias

107
Q

Q: Does MC or TD have lower variance? Why?

A

TD has lower variance as target only depends on a single random action, transition, and reward while MC depends on a whole trajectory of actions, transitions, and rewards.

108
Q

Q: What is the difference between Sarsa, Q-Learning, and Expected Sarsa?

A

TODO

109
Q
A