chapter 8 Flashcards
reinforcement learning
inspired by operant conditioning
contrasts with the supervised-learning method
requires no labeled training examples
an AGENT—the learning program—performs ACTIONS in an ENVIRONMENT (usually a computer simulation) and occasionally receives REWARDS from the environment. These intermittent rewards are the only feedback the agent uses for learning.
The promise of reinforcement learning
the agent can learn flexible strategies on its own simply by performing actions in the world and occasionally receiving rewards (that is, reinforcement) without humans having to MANUALLY WRITE RULES or DIRECTLY TEACH THE AGENT EVERY POSSIBLE CIRCUMSTANCE
state
the state of an agent at a given time is the agent’s perception of its current situation.
In the purest form of reinforcement learning, the learning agent doesn’t remember its previous states.
what does the algorithm do
tells the agent how to learn from her experiences.
Reinforcement learning occurs by
having the agent take actions over a series of learning EPISODES, each of which consists of some number of ITERATIONS.
What does the agent learn?
upon receiving a reward, the agent learns only about;
the STATE and the ACTION that immediately preceded the reward
the value of an action
he value of action A in state S is a number reflecting the agent’s current prediction of how much reward it will EVENTUALLY obtain if, when in state S, it performs action A, AND THEN CONTINUES PERFORMING HIGH-VALUE ACTIONS
the goal of reinforcement learning
for the agent to learn values that are good predictions of upcoming rewards (assuming that the agent keeps doing the right thing after taking the action in question)
Q-table
a table of states, actions, and values
Given a state, each action in that state has a numerical value; these values will change— becoming more accurate predictions of upcoming rewards—as Rosie continues to learn
Reinforcement learning is here the gradual updating of values in the Q- table
essence of q-learning
Rosie can now learn something about the action (Forward) she took in the immediately previous state (one step away).
so, memory
exploration versus exploitation balance
Deciding how much to explore new actions and how much to exploit
A naive strategy would be to always choose the action with the highest value for the current state in the Q-table.
Achieving the right balance is a core issue for making reinforcement learning successful.
two major stumbling blocks that might arise in extrapolating our “training Rosie” example to reinforcement learning in real-world tasks.
- the Q-table.
> in complex real-world tasks, it’s impossible to define a small set of “states” that could be listed in a table.
learning via a Q-table like the one in the “Rosie” example is out of the question.
>For this reason, most modern approaches to reinforcement learning use a neural network instead of a Q-table. - difficulty, in the real world, of actually carrying out the learning process over many episodes, using a real robot.
> You just wouldn’t have enough time.
> Moreover, you might risk the robot damaging itself by choosing the wrong action
the best-known reinforcement-learning successes have been in the domain of game playing.
episode of Q-learning
at each iteration the learning agent does the following:
- it figures out its current state
- looks up that state in the Q-table
- uses the values in the table to choose an action
- performs that action, possibly receives a reward
- the learning step: updates the values in its Q-table.