Week 5 Flashcards

1
Q

When is learning preferred over mini-max search?

A

-When good heuristics cannot be found
-When the branching factor is very high
-When equilibria are hard to compute
- Games of incomplete information
- General-sum games
- Games with many players

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe learning through “self play”

A

Two learners receiving states from a game environment and returning actions. They are rewarded based on the quality of their actions. They learn to optimise the reward

Two independent learning agents play against eachother.

Typically starting from naive states, they incrementally improve in an “arms race”

Each player in turn
1. Takes an action
2. Observes the new game position and any rewards (win/lose, etc)
3. Strengthens moves leading to wins; suppresses those leading to losses.
4. By playing many games, learns to become a strong player (we hope)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What happens if you have a perfect learner in self-play?

A

self play won’t work, the other learner will never win a game and thus never learn how to win

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What makes self-play hard? (characteristic of reinforcement learning)

A

Feedback from outcome of the game reveals you did something right or wrong but not what you did right or wrong

A sequence of moves is required before the outcome is revealed. You don’t know which move was crucial to the outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is reinforcement learning?

A

Learning from rewards:
1. Observe the situation (the state)
2. Perform an action
3. Receive a reward, positive or negative
By doing this repeatedly - learn to:
- maximise the positive rewards
- minimise the negative rewards

Learn an effective policy simply by taking actions and observing rewards
- Learning by trial and error
- No teacher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of learning should be used when the best action/response is not known?

A

Reinforcement learning

There may be appropriate actions for every situation:
- no expert can say what it is
- no labelled training data exists
The best actions need to be discovered, by trial and error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a policy in reinforcement learning?

A

What to do in any situation. What action to take in any situation or state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the exploration/exploitation trade-off

A

Exploration - Find new states or new actions which may lead to higher rewards

Exploitation - Perform those actions which have worked in the past (use the knowledge already acquired)

We need learning algorithms to do both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is epsilon greedy policy

A

An approach to balancing exploration and exploitation

0 <= epsilon <= 1

with probability 1 - epsilon play the believed best action

with probability epsilon play a random action

As the agent learns and improves it can be beneficial to decrease epsilon (less exploration)

Theoretical reasons suggest a decrease O(1/t) (see notes for examples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is immediate reward reinforcement learning?

A

The agent can observe the current state of the world at time t

It chooses an action “at” to perform

It receives rewards “rt” at time t, which is a function (possible stochastic - probabilistic) of the states and the actions

A “one-shot” game, it does not have to play more actions to receive the reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the tabular reinforcement learning approach for games with no state

A

Maintain a table V(i), where:
- V(i) - approximation to the average payoff when slot machine i is played
- The values of v(.) will be learned by playing machines

To choose the machine use epsilon-greedy
With probability (1-e): choose machine a: argmaxi V(i)
With probability e: choose a machine at random

If at time t, machine i was played with reward rt, let a be a small, positive learning rate
V[i] <- V[i] + a(rt - V[i])

Only the value associated with the machine which was played is updated

Could also update using:
V[i] <- V[i] + (rt - V[i])/ti
where ti is the number of times machine i was played up-to this point

The second method is equivalent to a time-dependent learning rate a(t) = 1/t
This maintains V[i] as the mean of the rewards when i was played.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When to use a constant learning rate for tabular learning

A

Use in a changing environment. A decreasing learning rate loses ability to respond to changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When to use a decreasing learning rate for tabular learning

A

Use in an unchanging environment. Produces a more precise estimation of the average reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Do we always know whether an environment is changing or not?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the tabular learning approach for games with state

A

Agent is represented by a big table with:
State-action pairs:
Q(st, at) - expected reward for taking action at from state st at time t. Called Q-Learning

State Values:
V(st) - the expected reward of a state st assuming a policy for taking the action from that state. Called state-value learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we need to approximate the Q function?

A

There are too many states, it will take too long to learn

Solutions for the video poker game:
- A (hand coded) representation of poker hands which uses a single representation for equivalent hands
- Use a supervised learning function approximation
- Input : A representation of the hand and
the action
- Desired output: the expected reward,
using a neural network, SVM, linear or
polynomial regression, etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to use a tabular approach after training?

A

First train, then use:
- Observe the state
- Determine all possible actions
- Look in the table for the action which produces the highest reward.

A “greedy” policy

18
Q

What is function approximation?

A

A supervised regression (not classification) model R(s, a|W)
- Uses a vector of learning parameters (e.g. weights) W = (w1,…,wd)

This can perhaps generalise to unseen state-action pairs.

19
Q

What’s a method for function approximation?

A

Gradient descent

20
Q

What is gradient descent?

A

An iterative, local improvement algorithm, like stochastic hill-climbing, but using information about the most improving direction (minus the gradient in this case)

21
Q

Do something on the aside on gradient ascent

A

See notes

22
Q

Why is immediate reward RL not applicable to game learning?

A

In extensive form games, a sequence of actions is required to receive a reward, but values are required for all decision nodes

How do you associate individual moves with final performance? (Called learning with delayed rewards)

23
Q

How do you learn delayed rewards in learning with delayed rewards?

A

Play to the end of the game, then back up results to the moves made

Learn to predict expected future rewards
- using one-step look ahead

24
Q

What is the value of a state in learning with delayed rewards?

A

Current plus predicted future reward from the state using a given policy from now on

25
Q

What is the value of a move from a state in learning with delayed rewards?

A

Current plus predicted future reward making that move from that state using a given policy from now on

26
Q

What is the future reward?

A

At time t, the future reward is,
rt + r(t+1) + r(t+2) + …

27
Q

What is the discounted future reward?

A

The longer you wait, the less valued the reward (by factor 0 <= y <= 1)

Rewards become less valuable the longer you have to wait for them

28
Q

What does the Q-table predict in learning with delayed rewards?

A
  • The immediate reward
  • The expected (discounted) reward assuming the best actions are taken from this point onward
29
Q

What should the Q function be in learning with delayed rewards?

A

Q(s,at) = the immediate reward +
estimate of future rewards

30
Q

What does Q(s,a) converge to?

A
  • The expected future reward one gets taking action a from state s
  • discounted by the time to get it
  • assuming the best action is chosen from this point forward
31
Q

Under what condition is Q-Learning proven to converge to the correct answer?

A

if every state is visited, and every action is used an infinite number of times i.e. via learning

32
Q

How does Q-learning work in a game setting?

A
  1. When a player makes a move, the updated state is returned only after the opponent makes a move
  2. When either player gets a reward, the opponent gets minus that reward
    - which is associated with the last move each made
33
Q

How would we train Q-learning to play a game?

A

Have two learning agents playing games against each other many many times.

Only after the learning phases are over can the agents be used to play against real players

34
Q

Describe how function approximation is used with Q-Learning

A

Too many nodes in a typical game tree

As before we can define a parameterised approximation function. Where W is a vector or parameters.

Then use Gradient Descent to minimize

35
Q

What are some problems with Q-Learning?

A

Q-Learning can still be too slow. A heuristic speed up is often used (This is TD(lambda)

36
Q

What is Q-Learning with a heuristic speed up?

A

TD(Lambda)

37
Q

What speed is TD(0)

A

It is very slow, but is provably correct

Initially only the state which led immediately to a reward has its value updated. Only through many sequences does learning work its way backwards towards the initial states

Why not update the value of every state in the sequence that led to the reward?

38
Q

Describe TD(lambda) learning

A

When reward received, assign credit (or blame) to the action
1. Just previous with a weight of 1
2. the one before that with the discounted weight * lambda
3. the one before that with the (discount weight * lambda) squared

Here lambda is a parameter between 0 and 1

39
Q

What does e(n) denote?

A

The “eligibility” of node n. This is related to how recently in the sequence the node n was visited

40
Q

Using what method do Rats navigate / ants find shortest path?

A

TD(lambda)