Module 2: Chapter 6 - Reinforcement Learning Flashcards

Question 1

Q

What is Reinforcement Learning?

Answer

A

Reinforcement learning is concerned with developing a policy for a series of decisions to maximize a long-term reward. In reinforcement learning, the learner is presented with feedback on the quality of the reward in a process analogous to trial-and-error. The technique is advantageous when decisions need to be made repeatedly so that the algorithm can learn based on the rewards or sanctions received in previous rounds. Unlike both unsupervised and supervised learning, the “output” from reinforcement learning applications is a recommended action given the circumstances, rather than a prediction, classification, or cluster

Question 2

Q

Define the important terms in Reinforcement Learning terminology

Answer

A

(1) The agent – the person or algorithm making the decision.

(2) Actions, A these are the possible choices that the agent can select from at each timestep.

(3) State, S – the circumstances, or a description of the environment, in which the decision is being made at each timestep.

(4) Reward, R – the feedback that the agent receives based on its previous action. Note that this could be either positive or negative (a reward or a sanction/penalty, respectively).

(5) Expected Future Rewards, G – the expected value of future rewards. The objective is usually to maximize G.

(6) Policy, pi – this is the plan of action that the agent takes based on observing the current state. The policy maps the states to actions that will maximize the reward.

(7) Value function, V (·) (Measures how good a state is) – This is also known as the state-value function. It relates the expected reward to a given state. It measures how good a state is.

(8) Action-value function, Q (·) (Measures how good an action is, given a certain state) – this relates the expected reward to the actions and the state. It measures how good an action is, given a certain state. This is very important and useful for us to compare the different actions and use them to optimize our policy.

The capital letters S and A are used to denote the set of states and the set of actions in general, whereas their lower-case counterparts denote specific states or actions. Time subscripts are generally suppressed unless they are specifically required for clarity, such as when describing the transition from the state in one period to the next.

Question 3

Q

What are the three possible strategies employed in Reinforcement Learning?

Answer

A

Greedy strategy (exploitation)
Random strategy (exploration)
ε-greedy strategy (exploitation & exploration)

Question 4

Q

What is the Greedy Strategy?

Answer

A

The greedy strategy is a simple strategy in which the agent always chooses the actions with the best rewards seen so far. This strategy focuses on the idea of exploitation of the information gained from the agent’s experience so far. It may appear like a good strategy, but it has problems. If we find one slot machine that seems to pay out well and stick with it, it may produce a suboptimal result because we did not experiment with other slot machines, which may be better. So, we should explore other possible actions. An extreme case would be random selection.

Question 5

Q

What is the Random Strategy?

Answer

A

The random strategy is an intuitive strategy where we randomly select a slot machine to play. Whereas the greedy strategy only chooses the action with the best payout up to that point, the random strategy is useful for exploring other possible actions. But it does not exploit the knowledge gained from past rewards to make more informed decisions over time.

We can see from the preceding two strategies that neither exploitation nor exploration alone are promising in the MAB problem. To address the shortcomings of a pure exploitation or pure exploration strategy, a combination of the two, called ε-greedy, was developed.

Question 6

Q

What is the ε-greedy strategy?

Answer

A

The ε-greedy strategy combines greedy exploitation and random exploration. Here ε is a hyperparameter (with 0 < ε < 1) that determines whether a random selection is made to explore, or a greedy selection is made to exploit. Usually, a random number between 0 and 1 is drawn. If that number is below ε, we explore by choosing a random slot machine, otherwise we choose the machine that had the best payoff up to that point. This helps us to continue exploiting the slot machines that have provided the best payouts up to that point, while still exploring other options.

Usually, a small value such as ε = 0.05 or 0.1 is used so that the agent mostly relies on existing accumulated knowledge, experimenting with new strategies relatively infrequently. Although the ε-greedy strategy does not use random selection, and it is more adaptive in the face of time-varying reward structures, it might still select an obviously suboptimal action in many random trials.

A refinement to the ε-greedy strategy is to allow ε to vary systematically throughout the exercise, so that it is initially larger, allowing a lot of experimentation while the amount of accumulated knowledge about the relationships between actions and rewards is low. Then the hyperparameter is gradually reduced as more information becomes known and the benefit of additional exploration is diminished because the algorithm has already learned more about the best strategy. A popular approach is to use a decay factor, β, and set ε = βt − 1, where t is the trial number and with 0 < β < 1.

Question 7

Q

What is a Markov Decision Process?

Answer

A

Markov decision processes (MDPs) are simple settings for environment dynamics. In this case the environment changes based on the actions of the agent. MDPs are processes that have no memory, which means that only the current state is relevant for determining the most appropriate current action and not any of the previous states.

MDPs are useful for modeling decision making in cases where the agent is not fully in control of the evolution of the states. The use of MDPs establishes a straightforward framework where there are m states, denoted s, each of which will occur with a given probability, and there is also a fixed probability of being in a particular state st + 1 at time t + 1 given that the state at time t was st

Question 8

Q

What are the two approaches to maximize Q?

Answer

A

(1) Value-based approaches that work on maximizing the reward by determining the best action in each state. Several algorithms such as the Temporal Difference Method, Q-learning, SARSA, and Deep Q-learning belong to this category.

(2) Policy-based approaches that find the optimal policy to map a state into an action. The policy gradient approach is a commonly used policy-based algorithm.

Question 9

Q

What are the Bellman Equations?

Answer

A

The Bellman equations provide a framework for evaluating policies. The equations can be written either in terms of values, V, or in terms of action-values, Q. They establish an updating mechanism, or in other words a recursive algorithm that sets the current value or action-value function for a given policy equal to the reward from the action a at time t, plus the discounted value function of future rewards at time t + 1. The equations make it possible to pinpoint actions that will lead to greater future values of R for each given state.

Solving the Bellman equations gives the optimal policy, pi raised to the asterisk power. If all the rewards and transition probabilities are known, then dynamic programming can be used to determine pi raised to the asterisk power. But they are unlikely to be known in practice, in which case an iterative technique is required. There are two common ways to solve reinforcement learning problems iteratively: the Monte Carlo and Temporal Difference methods.

Question 10

Q

What is the Monte Carlo Method?

Answer

A

In the Monte Carlo method, we conduct trials (termed episodes) repeatedly, assuming random initialization for each state and estimating the average reward for each state over all episodes for each policy pi. Through repeated trials, the algorithm develops an estimate of the expected value of acting A in state S. Usually, we perform trials until we meet a convergence criterion which depends on the problem domain or the environment. The resulting action-value function, Q (s, a), is the value of taking action A in state S. In the Monte Carlo method, we directly work with Q (s, a), the action-value function.

The Monte Carlo method for reinforcement learning has two main disadvantages. First, it can only be used for processes that have a finite horizon (in other words, the rewards do not continue into the indefinite future). Second, convergence to the true reward for each state could be infeasibly slow, particularly for long episodes with many states.

Question 11

Q

What is the Temporal Difference (TD) Method?

Answer

A

This technique begins by assuming that the agent is only aware of the states and possible actions that could be taken; the likelihood of each state and the transition probabilities are initially unknown. The objective is to estimate Vt(S), but unlike the Monte Carlo method, there is no need to wait until the end of the episode before assigning the rewards. Instead, the temporal difference method focuses on the reward following a transition from the current state to the next state only. Consequently, the temporal difference method can be used for very long episodes or those with infinite lifetimes.

The temporal difference method combines the immediate reward and the discounted value of the next state. Suppose that at time t the algorithm acts a in state s and a reward of R is earned.

Question 12

Q

What is the Curse of Dimensionality and Neural Network Approximation?

Answer

A

For the methods discussed so far in the chapter, we are still able to define the value function and action-value function as lookup tables, where the values are stored in a table corresponding to their states and actions. This is possible for simple reinforcement learning problems with a small number of actions and states. But it becomes impractical when the number of actions and states becomes very large, even infinite sometimes in problems such as self-driving cars, which would require an infinitely large table to store all possible actions and states. This also leads to an exponentially growing computational cost.

In many real-world applications, there are simply too many states, making it impractical to tabulate them all, so a different approach is required. In such circumstances, the task of the model will be to learn the value of each action as a function of the current state by averaging over many (but not all) possible future rewards arising after the action taken at this stage. Instead of calculating and then using the possibly infinitely large lookup table, we can estimate the Q value function for the given state and action, with a significantly smaller number of parameters we need to store compared to a tabular approach. Neural networks can be used to estimate the complete table from the observations that are available. This approach is called deep reinforcement learning. In this approach, the Q value function is estimated by training a neural network. The gradient descent method is used to estimate the weights of the network by minimizing a loss function which is the sum of the squared difference between the targeted and neural network’s estimate of the Q values. Once the network is trained, it can be used to generate Q values for any input set of state variables. This replaces the need to generate a very large lookup table for the Q values

Question 13

Q

Brainscape's Knowledge GenomeTM

Module 2: Chapter 6 - Reinforcement Learning Flashcards

Brainscape's Knowledge Genome^TM