13 Reinforcement Learning Flashcards

Question 1

Q

Reinforcement learning

Answer

A

Reinforcement learning (RL) is unsupervised learning: The agent receives no examples, and starts with no model or utility information. The agent must use trial-and-error, and receives rewards, or reinforcement, to guide learning.

Examples:
• Learning a game by making moves until lose or win: Reward only at the end
• Learning to ride a bicycle without any assistance: Rewards received more frequently

RL can be seen to encompass all of AI: An agent must learn to behave in an unknown environment

Question 2

Q

Variations of RL

Answer

A

Accessible environment (agent can use percepts) vs. inaccessible (must have some model).

Agent may have some initial knowledge, or not have any domain model.

Rewards can be received only in terminal states, or in any state.

Rewards can be part of the actual utility, or just hint at the actual utility.

The agent can be a passive (watching) or an active (exploring) learner.

RL uses results from sequential decision processes.

Question 3

Q

Sequential decision processes

Answer

A

In a sequential decision process, the agent’s utility depends on a sequence of decisions. Such problems involve utility (of states) and uncertainty (of action outcomes).

The agent needs a policy that tells it what to do in any state it might reach: a = π(s). An optimal policy π*(s) is a policy that gives the highest expected utility.

Example
- 4x3 environment
- Agent starts in (1, 1)
- Probabilistic transition model
- Sequence [Up, Up, Right, Right, Right] only has a probability of 0.85 = 0.33 of reaching +1(4, 3)
- Terminal rewards are +1(4, 3) and 1(4, 2)
Other state rewards are 0.04

Question 4

Q

Markov Decision Processes (MDP)

Answer

A

Markov Decision Processes (MDP)
An MDP is a sequential decision process with the following characteristics:
- Fully observable environment (agent knows where it is at any time)
- State transitions are Markovian, i.e. P(s’ | s,a) (probability of reaching s’ from s by action a) depends only on s and a, not earlier state history
- Agent receives a reward R(s) in each state s
- The total utility U(s) of s is the sum of the rewards received, from s until a terminal state is reached

Question 5

Q

Optimal policy and utility of states

Answer

A

The utility of a state s depends on rewards received but also on the policy π followed:

U^π(s) = E[Sum{ R(S_t) }]

Of all the possible policies the agent could follow, one gives the highest expected utility:

π*(s) = argmax_πU^π(s)

This is the optimal policy. Under certain assumptions, it is independent of starting state. For an MDP with known transition model, reward function and assuming the optimal policy, we can calculate the utility U(s) of each state s.

Knowing U (s) allows the agent to select the optimal action:

Question 6

Q

Bellman equations

Answer

A

We need to be able to calculate utilities U(s) in order to define optimal policy. Can exploit dependence between states: The utility of a state s is the reward R(s) plus the maximum expected utility of the next state [image].

This is the Bellman equation. There are n equations for n states, containing utilities U(s) as n unknowns. Solving the equations yields the utilities U(s).

Question 7

Q

Value iteration

Answer

A

The Bellman equations are nonlinear (due to the max opr.) and cannot be solved by linear algebra. Can use an iterative approach instead
• Start with arbitrary initial values
• Calculate right hand side
• Plug the value into left-hand sides U(s)
• Iterate until the values stabilize (within a margin)

This VALUE-ITERATION algorithm is guaranteed to converge and to produce unique solutions.

Question 8

Q

Policy iteration

Answer

A

Instead of iterating to find U(s) and derive an optimal policy, we can iterate directly in policies. We can iterate the policy to get an optimal one:

Policy evaluation: Given a policy ⇡i, calculate Ui, the utility of each state if ⇡i is followed

Policy improvement: Calculate new policy ⇡i+1 that selects the action that maximizes successor state value (MEU).

Repeat until values no longer change

This POLICY-ITERATION algorithm is guaranteed to converge and to produce an optimal policy

Question 9

Q

Reinforcement learning of MDP

Answer

A

We could find an optimal policy for an MDP if we know the transition model P(s’ | s,a). However, an agent in an unknown environment does not know the transition model nor in advance what rewards it will get in new states.

We want the agent to learn to behave rationally in an unsupervised process. The purpose of RL is to learn the optimal policy based only on received rewards.

Different RL agent designs
• Utility-based agents learn a utility function on states and uses it to select actions that maximize expected utility. Requires also a model of where actions lead

Q-learning agents learn an action-utility function (Q-function), giving expected utility of taking a given action in a given state. Can select actions without knowing where they lead, at the expense not being able to look ahead.
Reflex agents learn a policy that maps directly from states to actions, i.e. π*

Question 10

Q

Direct utility estimation

Answer

A

In passive learning, the agent’s policy is fixed, it only needs to how good it is. Agent runs a number of trials, starting in (1, 1) and continuing until it reaches a terminal state. The utility of a state is the expected total remaining reward (reward-to-go). Each trial provides a sample of the reward-to-go for each visited state. The agent keeps a running average for each state, which will converge to the true value. This is a direct utility estimation method.

Direct utility estimation converges slowly.

Question 11

Q

Exploiting state dependencies

Answer

A

Direct utility fails to exploit the fact that states are dependent as shown by Bellman equations.

Learning can be speeded up by using these dependencies.

Direct utility estimation can be seen to search a too large hypothesis space that contains many hypotheses violating Bellman equations

Question 12

Q

Adaptive Dynamic Programming (ADP)

Answer

A

An ADP agent uses dependencies between states to speed up value estimation. It follows a policy π and can use observed transitions to incrementally build the transition model P(s’ | s, π(s)). It can then plug the learned transition model and observed rewards R(s) into the Bellman equations to get U(s). The equations are linear because there is no max operator, and therefore easier to solve. The result is U(s) for the given policy π.

Question 13

Q

Temporal Difference (TD) learning

Answer

A

TD is another passive utility value learning algorithm using Bellman equations. Instead of solving the equations, TD uses the observed transitions to adjust the utilities of the observed states to agree with Bellman. TD uses a learning rate parameter α to select the rate of change of utility adjustment. TD does not need a transition model to perform its updates, only the observed transitions.

Question 14

Q

Passive learning

Answer

A

Direct utility estimation
Adaptive Dynamic Programming
Temporal Difference learning

Question 15

Q

Active learning

Answer

A

While a passive RL agent executes a fixed policy π, an active RL agent has to decide which actions to take. An active RL agent is an extension of a passive one, e.g. the passive ADP agent, and adds:
• Needs to learn a complete transition model for all actions (not just π), using passive ADP learning
• Utilities need to reflect the optimal policy π*, as expressed by the Bellman equations
• Equations can be solved by the VALUE-ITERATION or POLICY-ITERATION methods described before
• Action to be selected as the optimal/maximizing one

Question 16

Q

Exploration behavior

Answer

A

The active RL agent may select maximizing actions based on a faulty learned model, and fail to incorporate observations that might lead to a more correct model. To avoid this, the agent design could include selecting actions that lead to more correct models at the cost of reduced immediate rewards. This called exploitation vs. exploration tradeoff. The issue of optimal exploration policy is studied in a subfield of statistical decision theory dealing with so-called bandit problems.

Question 17

Q

Q-learning

Answer

A

An action-utility function Q assigns an expected utility to taking a given action in a given state: Q(a, s) is the value of doing action a in state s. Q-values are related to utility values:

U(s) = max{a} Q(a,s)

Q-values are sufficient for decision making without needing a transition model P (s’ | s, a). Can be learned directly from rewards using a TD-method based on an update equation (s -> s’)
Q(s,a) <- Q(s,a) + α(R(s) + maxQ(s’,a’) - Q(s,a))

Question 18

Q

RL applications

Answer

A

Generalization
In simple domains, U and Q can be represented by tables, indexed by state s. However, for large state spaces the tables will be too laPrge to be feasible, e.g. chess 1040 states. Instead functional approximation can sometimes be used, e.g. U ̆(s) = parameterixfeaturei(s). Instead of e.g . 1040 table entries, U can be estimated by e.g. 20 parameterized features. Parameters can be found by supervised learning. Problem: Such a function may not exist, and learning process may therefore fail to converge.

Policy search
A policy ⇡ maps states to actions: a = ⇡(s), and policy search tries to derive ⇡ directly. Normally interested in parameterized policy in order to get a compact representation. E.g., ⇡ can be represented by a collection of functional approximations Qˆ(s, a), one per action a, and policy will be to maximize over a
⇡(s) = maxaQˆ(s, a)

Policy search has been investigated in continuous/discrete and deterministic/stochastic domains.

Some examples of reinforcement learning
Game playing
Famous program by Arthur Samuel (1959) to play checkers used linear evaluation function for board positions, updated by reinforcement learning

Backgammon system (Tesauro, 1992) used TD-learning and self-play (200.000 games) to reach level comparable to top three human world masters

Robot control
Inverted pendulum problem used as test case for several successful reinforcement learning programs, e.g. BOXES (Michie, 1968) learned to balance pole after 30 trials