Week 5 (N/A) - Maching Learning In Games Flashcards
What is “Book Learning”?
- Learning a sequence of moves for important positions
e. g. #1 - the opening moves in Chess
e. g. #2 - learn from mistakes - identify moves that lead to a loss and whether there was a better alternative
How do you recognise which moves are important?
Page 4 Lecture 5
What is “Search Control Learning”?
- Learn how to make search more efficient
- ADJUSTING SEARCH PARAMETERS*
e. g. #1 - order of move generation affects a-b pruning
- Learning a preferred order for generating possible moves to prune more subtrees
e. g. #2 - vary the cut-off depth
- Learn a classifier to predict what depth the algorithm should search to depending on the current game state
How to deal with the challenge of being able to adjust search parameters in an agent for game playing?
Search control learning
How to handle each stage of the game separately in game playing?
Compile a “book” of opening games or end games - Book Learning
What is “Learning evaluation function weights”?
- Adjust weights in evaluation function based on experience of their ability to predict the true final utility
Eval(s) = W1 * F1(s) + W2 * F2(s) + … + Wn * Fn(s)
= sumof(Wi * Fi(s))
= W dotprod F(s)
What is Gradient Descent Learning?
[Follow-up]
A supervised learning method to train our agent to be able to predict the true minimax utility value of a state.
It achieves this by learning the weights of an evaluation function.
Training set = Set of features for a state
- Each vector is a state with the target label as the true minimax utility value
The iterative step makes use of the weight update rule? [Follow-up]
Repeatedly update the weights based on each learning example until the weights converge? [Follow-up]
How do you quantify how well the predicted output is compared to the actual output in Gradient Descent Learning?
Define an error function E
- E is (half) the sum over all training examples of the square of the differences between the predicted output (z) and the actual output (t)
error E = 1/2 * sumof(t - z)^2
Why is it that the Gradient Descent Learning algorithm moves in the steepest downhill direction?
The aim is to find a set of weight values for which E is very low - to minimise the errors between predicted and actual target output.
The less errors, the better our agent/classifier is.
Picture error E as height, it defines an error landscape on the weight space.
error E = 1/2 * sumof(t - z)^2
t = desired output (target) z = actual output (prediction)
Error Landscape is a convex cone with the steepest point at the bottom
Page 9 Lecture 5
Explain the application of the weight update rule and derive it.
Used for Gradient Descent Learning.
if y = y(u) and u = u(x) then
dy/dx = (dy/du)(du/dx)
So if z = Eval(s;w) and t = true utility state of s
dE/dw = d/dz [1/2(t - z)^2] dz/dw
= (z - t) dz/dw
= (z - t) f(s)
Repeatedly update the weights based on each learning example until the weights converge
What are the problems with Gradient Descent Learning?
- Delayed reinforcement
- Reward resulting from an action may not be received until several time steps later, which also slows down the learning
- Credit assignment
- Need to know which action(s) was responsible for the outcome
What is Temporal Difference Learning and how is it different from Supervised learning?
Temporal Difference (TD) is a form of REINFORCEMENT LEARNING
Supervised learning is for single step prediction
- predict Saturday’s weather based on Friday’s weather
TD learning is for multi-step prediction
- predict Saturday’s weather based on Monday’s weather, then update predicted based on Tuesday’s, Wednesday’s, …, etc.
- predict outcome of game based on first move, then update prediction as more moves are made
What are the problems with Temporal Difference Learning?
- Correctness of prediction is not known until several steps later
WTF is TDLeaf Algorithm?
Combines temporal difference learning with minimax search
Basic idea is to;
- Update weight in evaluation function to reduce differences in rewards predicted at different levels in search tree
- Good functions should be stable from one move to the next
What are the different ways to train a classifier/agent?
Learning from labelled examples
Learning by playing a skilled opponent
Learning by playing against random moves
Learning by playing against yourself