Week 5 Flashcards
When is learning preferred over mini-max search?
-When good heuristics cannot be found
-When the branching factor is very high
-When equilibria are hard to compute
- Games of incomplete information
- General-sum games
- Games with many players
Describe learning through “self play”
Two learners receiving states from a game environment and returning actions. They are rewarded based on the quality of their actions. They learn to optimise the reward
Two independent learning agents play against eachother.
Typically starting from naive states, they incrementally improve in an “arms race”
Each player in turn
1. Takes an action
2. Observes the new game position and any rewards (win/lose, etc)
3. Strengthens moves leading to wins; suppresses those leading to losses.
4. By playing many games, learns to become a strong player (we hope)
What happens if you have a perfect learner in self-play?
self play won’t work, the other learner will never win a game and thus never learn how to win
What makes self-play hard? (characteristic of reinforcement learning)
Feedback from outcome of the game reveals you did something right or wrong but not what you did right or wrong
A sequence of moves is required before the outcome is revealed. You don’t know which move was crucial to the outcome
What is reinforcement learning?
Learning from rewards:
1. Observe the situation (the state)
2. Perform an action
3. Receive a reward, positive or negative
By doing this repeatedly - learn to:
- maximise the positive rewards
- minimise the negative rewards
Learn an effective policy simply by taking actions and observing rewards
- Learning by trial and error
- No teacher
What type of learning should be used when the best action/response is not known?
Reinforcement learning
There may be appropriate actions for every situation:
- no expert can say what it is
- no labelled training data exists
The best actions need to be discovered, by trial and error.
What is a policy in reinforcement learning?
What to do in any situation. What action to take in any situation or state
What is the exploration/exploitation trade-off
Exploration - Find new states or new actions which may lead to higher rewards
Exploitation - Perform those actions which have worked in the past (use the knowledge already acquired)
We need learning algorithms to do both
What is epsilon greedy policy
An approach to balancing exploration and exploitation
0 <= epsilon <= 1
with probability 1 - epsilon play the believed best action
with probability epsilon play a random action
As the agent learns and improves it can be beneficial to decrease epsilon (less exploration)
Theoretical reasons suggest a decrease O(1/t) (see notes for examples)
What is immediate reward reinforcement learning?
The agent can observe the current state of the world at time t
It chooses an action “at” to perform
It receives rewards “rt” at time t, which is a function (possible stochastic - probabilistic) of the states and the actions
A “one-shot” game, it does not have to play more actions to receive the reward
Describe the tabular reinforcement learning approach for games with no state
Maintain a table V(i), where:
- V(i) - approximation to the average payoff when slot machine i is played
- The values of v(.) will be learned by playing machines
To choose the machine use epsilon-greedy
With probability (1-e): choose machine a: argmaxi V(i)
With probability e: choose a machine at random
If at time t, machine i was played with reward rt, let a be a small, positive learning rate
V[i] <- V[i] + a(rt - V[i])
Only the value associated with the machine which was played is updated
Could also update using:
V[i] <- V[i] + (rt - V[i])/ti
where ti is the number of times machine i was played up-to this point
The second method is equivalent to a time-dependent learning rate a(t) = 1/t
This maintains V[i] as the mean of the rewards when i was played.
When to use a constant learning rate for tabular learning
Use in a changing environment. A decreasing learning rate loses ability to respond to changes.
When to use a decreasing learning rate for tabular learning
Use in an unchanging environment. Produces a more precise estimation of the average reward.
Do we always know whether an environment is changing or not?
No
Explain the tabular learning approach for games with state
Agent is represented by a big table with:
State-action pairs:
Q(st, at) - expected reward for taking action at from state st at time t. Called Q-Learning
State Values:
V(st) - the expected reward of a state st assuming a policy for taking the action from that state. Called state-value learning
Why do we need to approximate the Q function?
There are too many states, it will take too long to learn
Solutions for the video poker game:
- A (hand coded) representation of poker hands which uses a single representation for equivalent hands
- Use a supervised learning function approximation
- Input : A representation of the hand and
the action
- Desired output: the expected reward,
using a neural network, SVM, linear or
polynomial regression, etc…