Reinforcement_Learning Flashcards

Question

Soft-Max Distribution

Answer 1

Probability distribution (sums to 1) over a set of mutually exclusive events. Takes a vector of inputs and returns a vector of outputs, where each output represents the probability of the input belonging to a particular output class

Answer 2

Classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent states, and through those future rewards. Mathematically idealized for of the reinforcement learning problem for which precise theoretical statements can be made Proposes that any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between the agent and the environment (action, reward, states)

Answer 3

Learner and decision maker of the RL problem Objective is to maximize the amount of reward it receives over time (return / expected value) Needs to be able to sense the environment, take action to change the environment, and process received reward

Answer 4

The entity that the agent interact with, comprising everything outside the agent

Answer 5

A choice made by the agent to take upon the environment, in service of maximizing expected return. In general, actions can be any decisions we want to learn how to make

Answer 6

The list of [State, Action, Reward] per time-step appended with all time-steps taken

Answer 7

p(s',r|s,a) Defined discrete probability distributions dependent only on the preceding state and action The probability of going to state s' and receiving reward r given the current state, action pair.

Answer 8

Representation of the environment to the agent. Returned to the agent in response to an action on the environment during the agent-environment interface (along with the reward) Must include information about all aspects of the past sequence that makes a difference for the future (Markovian) In general, can be anything we can know that might be useful in making decisions about which action to take Basic for how the agent makes decisions

Answer 9

If a state includes all information about all aspects of the past sequence that makes a difference for the future

Answer 10

Boundary between what we consider the agent and what we consider the environment. The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of the environment Represents the limit of the agent's absolute control, not of its knowledge

Answer 11

The continuous interaction between the agent and environment where the agent selects actions and the environment responding to these actions by returning a reward and presenting a new state to the agent

Answer 12

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

Answer 13

G Secondary, Delayed Sum of rewards starting at time-step t and ending at the Terminal State

Answer 14

A single iteration in the agent-environment loop

Answer 15

A state of an environment that terminates the agent-environment interaction loop Followed by a reset to a standard starting state

Answer 16

Any task that can be broken up into episodes

Answer 17

Any task that is not broken up into episodes

Answer 18

A process by which future rewards are lessened in order to give priority to more immediate rewards via the Discounting Rate

Answer 19

Special state that transitions only to itself and generates only rewards of zero Used to make continuous tasks finite

Answer 20

v(s) or q(s,a) Estimate how good it is for an agent to be in a given state or preform a given action in a given state Expected return when starting in state s and following policy pi thereafter = E[G_t|S_t = s]

Answer 21

pi(s) Mapping from states to PROBABILITIES of selecting each possible action Subcomponent of the agent Reinforcement learning methods specify how the agent's policy is changed as a result of its experience

Answer 22

Diagram relationships that form the basis of the update or backup operations that are at the heart of reinforcement learning methods Assist in providing graphical summaries of the RL algorithms

Answer 23

pi_* The policy that is better than or equal to all other policies

Answer 24

RL problems that use small, finite state sets that are modeled using arrays/tables with one entry for each state (or state-action pair)

Answer 25

RL problems that have state spaces too large for arrays/tables and must be approximated using some sore of more compact parameterized function representation

Answer 26

The agent has a complete an accurate model of the environment's dynamics (p)

Answer 27

Expresses a recursive relationship between the current value of a state and the values of its successor states States that the value of a state must equal the discounted value of the expected next state, plus the reward expected along the way

Answer 28

0 <= gamma <= 1 Determines the present value of future rewards: a reward received k time steps in the future is only worth y^(k-1) times what is would be worth if it were received immediately 0 = myopic = prioritize only immediate rewards 1 = farsighted = prioritize future rewards equal to immediate rewards

Answer 29

Operations that "transfer" value information back to previous states (or state-action pairs) from successor states (or state-action pairs)

Answer 30

v_*(s) or q_*(s,a) The value-function shared by the all Optimal Policies

Answer 31

Expresses that the value of a state under an optimal policy must equal the expected return for the action from that state

Answer 32

Refers to a collection of algorithms that can be used to compute optimal policies GIVEN A PERFECT MODEL OF THE ENVIRONMENT as a MDP We use a value function to organize and structure the search for good policies DP algorithms are obtained by turning Bellman Equations into assignments = into update rules for improving approximations of the desired value functions Depth = Minimum (1) Width = Maximum (Expected Updates) One-Step Expected-Update Method

Answer 33

Process of producing a value function given an input policy

Answer 34

Taking from 1 to many iterations of policy evaluation for a given policy

Answer 35

Replacing the old value of s with a new value of s obtained from the old values of the successor states of s, and the expected immediate rewards, along with the one-step transitions possible under the policy being evaluated "Expected" because they are based on an expectation over all possible next states rather than on a sample next state

Answer 36

Visiting every state in the state space in a deterministic order

Answer 37

Process of producing a policy given an input value function The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy

Answer 38

Iteration between: 1. Full Policy Evaluation (go until the value function does not change upon update) followed by 2. Full Policy Improvement (go until the policy does not change upon update)

Answer 39

Similar to Policy Iteration however, Iteration between: 1. Single sweep of Policy Evaluation followed by 2. Single sweep of Policy Improvement

Answer 40

Same as Dynamic Programming but instead of updating states in sweeps, algorithm updates the values of the states in any order whatsoever, using whatever values of other states happen to be available

Answer 41

General idea of letting policy-evaluation and policy-improvement processes interact, independent of the granularity and other details of the two processes

Answer 42

Ways of solving the reinforcement learning problem based on averaging sample returns Average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value Estimates for each state are independent, and bootstrapping does not occur Provide an alternative policy evaluation process (compared to DP). Rather than use a model to COMPUTE the value of each state, they simply average many returns that start in the state.

Answer 43

Sample sequences of sates, actions, and rewards from actual or simulated interaction with an environment

Answer 44

Estimates value of state as an average of the returns following FIRST visits to s

Answer 45

Estimates value of state as an average of the returns following EVERY visit to s

Answer 46

A method of exploration where we start our environment with a random state-action pair thereby guaranteeing that all state-action pairs will be visited

Answer 47

The problem of finding the optimal policy for a given environment

Answer 48

Attempt to evaluate and improve the policy that is used to make decisions

Answer 49

Attempt to evaluate or improve a policy different from that used to generate the data Learning is from data "off" the target policy

Answer 50

Soft meaning that all actions have a probability of being selected (stochastic) Hard meaning that there is a single action that is best/optimal (deterministic)

Answer 51

Policy being learned

Answer 52

Policy used to generate behavior

Answer 53

A general technique for estimating expected values under one distribution given samples from another distribution

Answer 54

Rho (p) Weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies

Answer 55

Importance Sampling done as a simple average

Answer 56

Importance Sampling done as a weighted average

Answer 57

The requirement in off-policy learning that every action taken under the target policy (pi) is also taken, at least occasionally, under the behavior policy (b)

Answer 58

Taking actions to better understand environment

Answer 59

Taking actions that are known and generate the maximum return currently understood

Answer 60

In order to exploit more, we need to explore but if we are exploring, we are not exploiting. We can either choose to explore or exploit but we cannot do both

Answer 61

Solution methods based on general principles, such as searching or learning

Answer 62

Solution methods based on specific domain knowledge

Answer 63

1. Seek to estimate value function 2. Operate by backing up values along actual or possible state transitions 3. Follow general strategy of Generalized Policy Iteration (GPI)

Answer 64

Both dimensions describe the kind of update used to improve the value function. 1. Depth of Update - The degree of bootstrapping 2. Width of Update - Are sample updates used or expected updates? - Expected Updates = exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities. - Sample Updates = Approximate of the expected value by averaging over a sample of trajectories

Answer 65

= Exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities. Ex. Used in DP

Answer 66

= Approximate of the expected value by averaging over a sample of trajectories. Ex. Used in MC and TD

Answer 67

Expected = Exact expectation of future reward by summing over all possible next state and actions, weighted by their probabilities. Sample = Approximate of the expected value by averaging over a sample of trajectories

Answer 68

Depth = Minimum (1) Width = Maximum (Expected Updates) One-Step Expected-Update Method

Answer 69

Depth = Maximum (No Bootstrapping) Width = Minimum (Sample Updates) Non-bootstrapping Sample-Update Method

Answer 70

Depth = Minimum (1-step Bootstrapping) Width = Minimum (Sample Update) 1-step Bootstrapping Sample-Update Method

Answer 71

TD(n=0) = Temporal-Difference Depth = Minimum (1-step Bootstrapping) Width = Minimum (Sample Update) 1-step Bootstrapping Sample-Update Method TD(n=∞) = Monte-Carlo Depth = Maximum (No Bootstrapping) Width = Minimum (Sample Updates) Non-bootstrapping Sample-Update Method

Answer 72

TD(λ=0) = Temporal-Difference Depth = Minimum (1-step Bootstrapping) Width = Minimum (Sample Update) 1-step Bootstrapping Sample-Update Method TD(λ=1) = Monte-Carlo Depth = Maximum (No Bootstrapping) Width = Minimum (Sample Updates) Non-bootstrapping Sample-Update Method

Answer 73

Depth = Maximum (No Bootstrapping) Width = Maximum (Expected Updates)

Answer 74

A way of structuring the search for the optimal policy

Answer 75

1. Can learn optimal behavior directly from interaction with environment, no model of environment dynamics needed 2. Can be used with simulation or "sample models" 3. Easy and efficient to focus on a small subset of states. Do not NEED to sweep over ALL states. 4. Less harmed by violations of the Markov property because MC does not update value estimates on the basis of the value estimates of successor states (no bootstrapping, unlike DP and TD)

Answer 76

Without a model, we don't know which action given our state will produce the highest reward. Previously, this was dictated by our environment dynamics and could be computed directly. How, we must learn through sample updates and LEARN this for ourselves.

Answer 77

MC has lower bias (it is unbiased) as its estimate is literally the expectation of the value while TD bootstraps and therefore introduces bias

Answer 78

TD has lower variance as target only depends on a single random action, transition, and reward while MC depends on a whole trajectory of actions, transitions, and rewards.

Reinforcement_Learning Flashcards

- Reinforcement Learning, An Introduction (109 cards)