Chapter 1 Flashcards

Question 1

Q

2 most important characteristics of RL

Answer

A

Trial and Error search

2. Delayed reward

Question 2

Q

Difference between RL + SL

Answer

A

SL - Objective: Generalize its responses so that it acts correctly in situations not present in the training set.

Learn from a training set of labeled examples provided by a knowledgeable external supervisor. Each example is a description of a situation together with a label of the correct action that the system should take in that situation.

RL -> For interactive problems, we cannot obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In uncharted territory, an agent must be able to learn from its own experience

Question 3

Q

Difference between RL + UL

Answer

A

UL ->
Finding structure hidden in collections of unlabeled data

RL ->
Maximize reward instead of trying to find hidden structure

Question 4

Q

challenge with RL

Answer

A

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best.

Question 5

Q

Share features of RL examples

Answer

A

All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment.

The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position, the level of reservoirs of the refinery, the robot’s next location and the future charge level of its battery), thereby affecting the actions and opportunities available to the agent at later times.

Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.

Question 6

Q

6 subelements of RL

Answer

A

Agent

Environment

Policy -

Mapping from perceived states of the environment to actions to be taken when in those states.
Corresponds to what in psychology would be called a set of stimulus–response rules or associations.
simple function or lookup table
core of a reinforcement learning agent in the sense that it alone is su fficient to determine behavior.
policies may be stochastic, specifying probabilities for each action.

Reward Signal -

defines the goal of a reinforcement learning problem.
On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run.
primary basis for altering the policy; if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.

Value Function -

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.
Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states.
To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state.
We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run.
Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.

Model (optional)

This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.
Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Question 7

Q

How does RL differ from genetic algorithms, genetic programming, simulated annealing, and other optimization methods

Answer

A

ONLY RL estimate value functions

These methods apply multiple static policies each interacting over an extended period of time with a separate instance of the environment. The policies that obtain the most reward, and random variations of them, are carried over to the next generation of policies, and the process repeats.

Question 8

Q

Explain how tic tac toe is solved with classical optimization methods

Answer

A

Classical optimization methods for sequential decision problems, such as dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent, including the probabilities with which the opponent makes each move in each board state.

About the best one can do on this problem is first to learn a model of the opponent’s behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model.

Question 9

Q

Explain how tic tac toe is solved with evolutionary methods

Answer

A

The frequency of wins gives an unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection. But each policy change is made only after many games, and only the final outcome of each game is used: what happens during the games is ignored

directly search the space of possible policies for one with a high probability of winning against the opponent.

Here, a policy is a rule that tells the player what move to make for every state of the game—every possible configuration of Xs and Os on the three-by-three board.

For each policy considered, an estimate of its winning probability would be obtained by playing some number of games against the opponent.

This evaluation would then direct which policy or policies were considered next. A typical evolutionary method would hill-climb in policy space, successively generating and evaluating policies in an attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm could be used that would maintain and evaluate a population of policies. Literally hundreds of different optimization methods could be applied.

Question 10

Q

Explain how tic tac toe is solved with a value function

Answer

A

Set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state’s value, and the whole table is the learned value function.

State A has higher value than state B, or is considered “better” than state B, if the current estimate of the probability of our winning from A is higher than it is from B.

Assuming we always play Xs, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are filled up, the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning.

We then play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see.

While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back up” the value of the state after each greedy move to the state before the move. More precisely, the current value of the earlier state is updated to be closer to the value of the later state.

Question 11

Q

Exercise 1.1: Self-Play: Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself. What do you think would happen in this case? Would it learn a different way of playing?

Answer

A

The algorithm will continue to adapt until it reaches an equilibrium, which may be either fixed (always making the same moves), or cyclical. It is possible to reach a higher skill level than against a fixed opponent but it may also take miss paths toward stronger over all play because it is reacting to itself

if two players always choose the best move for themself then they reach what is called a nash equilibria.

A nash equilibria happens when no player is willing to change his moves because his objective function will decrease (More or less ).

If an agent is playing against itself, since they are using the same policy, in my opinion they will reach a nash equilibria because they will learn the best moves based on the same policy, and so both will always use the best move possible

Question 12

Q

Exercise 1.2: Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

Answer

A

We could improve our reinforcement learning algorithm by taking advantage of symmetryby simplifying the definition of the “state” and “action” upon which the algorithm wouldworks. By simplifying the state in such a way that the dimension decreases we can be moreconfident that our learned results will be statistically significant since the state space weoperate in is reduced. If our opponentwastaking advantage of symmetries in the gametic-tac-toe our algorithm should also since this fact, would enable us to be a better gameplayer against this type of player. If our player doesnotuse symmetries then our algorithmshould not either (except to reduce the state space as discussed above) since enforcing asymmetry on our opponent (that is not in fact there) should decrease our performance whenplaying against this type of opponent

Question 13

Q

Exercise 1.3: Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

Answer

A

As will be discussed later in this book a greedy approach willnot be able to learn moreoptimal moves as play unfolds. Thus there is a natural trade-off between attempting toexplore the space of possibilities and selecting the actionon this step that has the greatestreward. Problems with direct greedy play would be that our player fails to be able tocapture moves that result in improved rewards because we never take a chance on unknown(or unexplored) actions.

Question 14

Q

Exercise 1.4: Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

Answer

A

If we do not learn from exploratory moves then the state probabilities learned would effectively be random in that we are not updating our undertaking of what happens when in a given state and a given action is taken.

If we learn from our exploratory moves then our limiting probabilities should be those from the desired distribution of state and action selections. Obviously a more complete understanding of the probability densities should result in a better play since the player better understands the “game” he or she is playing

Question 15

Q

Exercise 1.5: Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

Answer

A

One possible way would be to have a saved library of scripted plays. For example the logic would be something like, when in a set of known states always execute the following moves. This is somewhat like the game of chess where there are various “opening” positions that expert players have deemed good. Hopefully this might expedite the total learning processor at least improve our reinforcement players initial play.

Since the the tic-tac-toe problem is so simple we can solve this problem (using recursion) and computing all possible opponents moves and selecting at each step the move that optimizes our chance of winning.

Question 16

Q

3 threads of RL that converged

Answer

Study These Flashcards

A

Trial and error - originated in the psychology of animal learning
Optimal control and its solution using value functions and dynamic programming
temporal difference methods (
Temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment.

This means temporal difference takes a model-free or unsupervised learning approach. You can consider it learning from trial and error.

Example: Q learning
)

Question 17

Q

What do we mean by optimal control

Answer

Study These Flashcards

A

“optimal control” came into use in the late 1950s to describe the problem of designing a controller to minimize or maximize a measure of a dynamical system’s behavior over time.

Question 18

Q

Key problem with general stochastic optimal control problems

Answer

Study These Flashcards

A

It suffers from what Bellman called “the curse of dimensionality,” meaning that its computational requirements grow exponentially with the number of state variables, but it is still far more e cient and more widely applicable than any other general method.

Question 19

Q

basic credit-assignment problem for complex rein- forcement learning systems

Answer

Study These Flashcards

A

How do you distribute credit for success among the many decisions that may have been involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward solving this problem.

Question 20

Q

When does stationarity in your policies apply

Answer

Study These Flashcards

A

As long as you have infinite horizons
Pi(s) -> a

If you add in time + finite horizons, your action may change based on time as well as state
Pi(s, t) -> a

Question 21

Q

3 ways to balance exploration with exploitation

Answer

Study These Flashcards

A

Random (e-greedy, softmax)
Optimism in the face of uncertainty
- Prefer to explore states / actions with highest uncertainty
Information state space
- Explore based on what we know the agent has explored

Question 22

Q

What are the types of exploration?

Answer

Study These Flashcards

A

state/action exploration - Pick a different action each time you come back to a state
parameter exploration - pick different parameters (e.g. different walks)

Question 23

Q

Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

Answer

Study These Flashcards

A

The MDP framework fails when the state cannot be fully observed. For example, one may try to control the temperature of a house using a state space represented by one thermometer in one room. Our agent would control the temperature observed by that thermometer well, but would be blind to temperatures that aren’t measured by said thermometer elsewhere in the house.

Question 24

Q

Exercise3.3 Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

Answer

Study These Flashcards

A

There’s a trade-off between state-action space complexity, computational expense and accuracy. If we draw the boundary at the brain we would create a state-action space contingent on the number of neurons in the brain and their interplay with physical actions like turning the steering wheel; too large to be stored or computed efficiently. Equally, if we draw the boundary at the journey level, then we miss the detail required to act on second-by-second state changes on the road that could lead to a crash. The fundamental limiting factor in this selection is whether the goal can be achieved safely at your chosen layer of abstraction, indeed this feels like one of the tasks of engineering more widely.

Chapter 1 Flashcards

(24 cards)