Week 5 Flashcards

Question 1

Q

When is learning preferred over mini-max search?

Answer

A

-When good heuristics cannot be found
-When the branching factor is very high
-When equilibria are hard to compute
- Games of incomplete information
- General-sum games
- Games with many players

Question 2

Q

Describe learning through “self play”

Answer

A

Two learners receiving states from a game environment and returning actions. They are rewarded based on the quality of their actions. They learn to optimise the reward

Two independent learning agents play against eachother.

Typically starting from naive states, they incrementally improve in an “arms race”

Each player in turn
1. Takes an action
2. Observes the new game position and any rewards (win/lose, etc)
3. Strengthens moves leading to wins; suppresses those leading to losses.
4. By playing many games, learns to become a strong player (we hope)

Question 3

Q

What happens if you have a perfect learner in self-play?

Answer

A

self play won’t work, the other learner will never win a game and thus never learn how to win

Question 4

Q

What makes self-play hard? (characteristic of reinforcement learning)

Answer

A

Feedback from outcome of the game reveals you did something right or wrong but not what you did right or wrong

A sequence of moves is required before the outcome is revealed. You don’t know which move was crucial to the outcome

Question 5

Q

What is reinforcement learning?

Answer

A

Learning from rewards:
1. Observe the situation (the state)
2. Perform an action
3. Receive a reward, positive or negative
By doing this repeatedly - learn to:
- maximise the positive rewards
- minimise the negative rewards

Learn an effective policy simply by taking actions and observing rewards
- Learning by trial and error
- No teacher

Question 6

Q

What type of learning should be used when the best action/response is not known?

Answer

A

Reinforcement learning

There may be appropriate actions for every situation:
- no expert can say what it is
- no labelled training data exists
The best actions need to be discovered, by trial and error.

Question 7

Q

What is a policy in reinforcement learning?

Answer

A

What to do in any situation. What action to take in any situation or state

Question 8

Q

What is the exploration/exploitation trade-off

Answer

A

Exploration - Find new states or new actions which may lead to higher rewards

Exploitation - Perform those actions which have worked in the past (use the knowledge already acquired)

We need learning algorithms to do both

Question 9

Q

What is epsilon greedy policy

Answer

A

An approach to balancing exploration and exploitation

0 <= epsilon <= 1

with probability 1 - epsilon play the believed best action

with probability epsilon play a random action

As the agent learns and improves it can be beneficial to decrease epsilon (less exploration)

Theoretical reasons suggest a decrease O(1/t) (see notes for examples)

Question 10

Q

What is immediate reward reinforcement learning?

Answer

A

The agent can observe the current state of the world at time t

It chooses an action “at” to perform

It receives rewards “rt” at time t, which is a function (possible stochastic - probabilistic) of the states and the actions

A “one-shot” game, it does not have to play more actions to receive the reward

Question 11

Q

Describe the tabular reinforcement learning approach for games with no state

Answer

A

Maintain a table V(i), where:
- V(i) - approximation to the average payoff when slot machine i is played
- The values of v(.) will be learned by playing machines

To choose the machine use epsilon-greedy
With probability (1-e): choose machine a: argmaxi V(i)
With probability e: choose a machine at random

If at time t, machine i was played with reward rt, let a be a small, positive learning rate
V[i] <- V[i] + a(rt - V[i])

Only the value associated with the machine which was played is updated

Could also update using:
V[i] <- V[i] + (rt - V[i])/ti
where ti is the number of times machine i was played up-to this point

The second method is equivalent to a time-dependent learning rate a(t) = 1/t
This maintains V[i] as the mean of the rewards when i was played.

Question 12

Q

When to use a constant learning rate for tabular learning

Answer

A

Use in a changing environment. A decreasing learning rate loses ability to respond to changes.

Question 13

Q

When to use a decreasing learning rate for tabular learning

Answer

A

Use in an unchanging environment. Produces a more precise estimation of the average reward.

Question 14

Q

Do we always know whether an environment is changing or not?

Question 15

Q

Explain the tabular learning approach for games with state

Answer

A

Agent is represented by a big table with:
State-action pairs:
Q(st, at) - expected reward for taking action at from state st at time t. Called Q-Learning

State Values:
V(st) - the expected reward of a state st assuming a policy for taking the action from that state. Called state-value learning

Question 16

Q

Why do we need to approximate the Q function?

Answer

A

There are too many states, it will take too long to learn

Solutions for the video poker game:
- A (hand coded) representation of poker hands which uses a single representation for equivalent hands
- Use a supervised learning function approximation
- Input : A representation of the hand and
the action
- Desired output: the expected reward,
using a neural network, SVM, linear or
polynomial regression, etc…

Question 17

Q

How to use a tabular approach after training?

Answer

A

First train, then use:
- Observe the state
- Determine all possible actions
- Look in the table for the action which produces the highest reward.

A “greedy” policy

Question 18

Q

What is function approximation?

Answer

A

A supervised regression (not classification) model R(s, a|W)
- Uses a vector of learning parameters (e.g. weights) W = (w1,…,wd)

This can perhaps generalise to unseen state-action pairs.

Question 19

Q

What’s a method for function approximation?

Answer

A

Gradient descent

Question 20

Q

What is gradient descent?

Answer

A

An iterative, local improvement algorithm, like stochastic hill-climbing, but using information about the most improving direction (minus the gradient in this case)

Question 21

Q

Do something on the aside on gradient ascent

Answer

A

See notes

Question 22

Q

Why is immediate reward RL not applicable to game learning?

Answer

A

In extensive form games, a sequence of actions is required to receive a reward, but values are required for all decision nodes

How do you associate individual moves with final performance? (Called learning with delayed rewards)

Question 23

Q

How do you learn delayed rewards in learning with delayed rewards?

Answer

A

Play to the end of the game, then back up results to the moves made

Learn to predict expected future rewards
- using one-step look ahead

Question 24

Q

What is the value of a state in learning with delayed rewards?

Answer

A

Current plus predicted future reward from the state using a given policy from now on

Question 25

Q

What is the value of a move from a state in learning with delayed rewards?

Answer

A

Current plus predicted future reward making that move from that state using a given policy from now on

Question 26

Q

What is the future reward?

Answer

A

At time t, the future reward is,
rt + r(t+1) + r(t+2) + …

Question 27

Q

What is the discounted future reward?

Answer

A

The longer you wait, the less valued the reward (by factor 0 <= y <= 1)

Rewards become less valuable the longer you have to wait for them

Question 28

Q

What does the Q-table predict in learning with delayed rewards?

Answer

A

The immediate reward
The expected (discounted) reward assuming the best actions are taken from this point onward

Question 29

Q

What should the Q function be in learning with delayed rewards?

Answer

A

Q(s,at) = the immediate reward +
estimate of future rewards

Question 30

Q

What does Q(s,a) converge to?

Answer

A

The expected future reward one gets taking action a from state s
discounted by the time to get it
assuming the best action is chosen from this point forward

Question 31

Q

Under what condition is Q-Learning proven to converge to the correct answer?

Answer

A

if every state is visited, and every action is used an infinite number of times i.e. via learning

Question 32

Q

How does Q-learning work in a game setting?

Answer

A

When a player makes a move, the updated state is returned only after the opponent makes a move
When either player gets a reward, the opponent gets minus that reward
- which is associated with the last move each made

Question 33

Q

How would we train Q-learning to play a game?

Answer

A

Have two learning agents playing games against each other many many times.

Only after the learning phases are over can the agents be used to play against real players

Question 34

Q

Describe how function approximation is used with Q-Learning

Answer

A

Too many nodes in a typical game tree

As before we can define a parameterised approximation function. Where W is a vector or parameters.

Then use Gradient Descent to minimize

Question 35

Q

What are some problems with Q-Learning?

Answer

A

Q-Learning can still be too slow. A heuristic speed up is often used (This is TD(lambda)

Question 36

Q

What is Q-Learning with a heuristic speed up?

Answer

A

TD(Lambda)

Question 37

Q

What speed is TD(0)

Answer

A

It is very slow, but is provably correct

Initially only the state which led immediately to a reward has its value updated. Only through many sequences does learning work its way backwards towards the initial states

Why not update the value of every state in the sequence that led to the reward?

Question 38

Q

Describe TD(lambda) learning

Answer

A

When reward received, assign credit (or blame) to the action
1. Just previous with a weight of 1
2. the one before that with the discounted weight * lambda
3. the one before that with the (discount weight * lambda) squared

Here lambda is a parameter between 0 and 1

Question 39

Q

What does e(n) denote?

Answer

A

The “eligibility” of node n. This is related to how recently in the sequence the node n was visited

Question 40

Q

Using what method do Rats navigate / ants find shortest path?

Answer

A

TD(lambda)