Week 6 Flashcards
What class of algorithm is Q learning part of?
Temporal Difference (TD) Learning or TD(0) Learning
What is the goal of temporal difference learning?
Predict expected (possible discounted) future reward:
- Given a particular policy for choosing subsequent actions.
What is a policy in temporal difference learning?
The choice of action over a given state.
at = pi(st)
Policy is conventionally denoted pi in reinforcement learning. Called strategy in game theory
Does Q-Learning have to get to the terminal node to get information?
Yes
What is Deep-Q Learning?
Deep-Q Learning uses a neural network to approximate the Q-function.
Often a convolutional neural network
Convolutional neural networks are very effective at large image processing
- Treats the input as an image rather than a collection of features
- e.g. the board
Used by DeepMind to create agents that can play video games. Later to learn to play Go.
Convolutional Neural Networks
- Learns local feature detectors
- Weights constrained to learn the same feature at different parts of the image
- Higher level layers are building features of features
What assumption in statistics, machine learning, data science, etc.. makes learning in games less effective?
The data is independent and stationary
But in games, subsequent data are not independent.
And in video games, subsequent data are frames - highly correlated.
What is the difference between TD-Learning and Monte Carlo learning styles
In TD-Learning you learn, then you play
In Monte-Carlo learning you learn while you play. You learn between moves. (Like minimax search).
Monte Carlo Board Evaluation
Simple way to generate a heuristic function. Often called “playouts”.
For position v, do the following N times:
- Play the game to the end, all players choose random moves.
- Return the average payoff
Estimates the strength of the board position by the average payoff to the nodes accessible to it.
Downside is the time cost to perform the N playouts per board position. Although you are not playing against a random play, it can be very useful
What is MCTS
Monte Carlo Tree Search
Elements of MCTS
Game tree is built, incrementally but asymmetrically.
For each iteration, a tree policy is used to find the most urgent node to expand, using a strategy which balances exploration and exploitation.
A simulation is run from the selected node using a default policy (often random choice of action)
Update the search tree according to the result. This updates the value of the selected node and all its ancestors.
Return the best child of the current root node
What are the four phases of MCTS?
Selection
- Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children
Expansion
- Increase the size of the current partial tree (typically by one node) by expanding nodes (e.g. visit and unvisited child)
Play-out (simulation)
Run a simulation from the added node by random self-play.
Back-up
The statistics of the added node(s) and its ancestors are updated
What is UCT?
Upper confidence trees is a preferred selection method
- An exploration strategy, like epsilon greedy
- Always chooses unselected children
- Between two children of the same parent with the same average value, chooses child selected less often.
- In the long run, children with lower true values will be chosen less often. The best child will be chosen exponentially more often.
In UCT what does each node store?
Two quantities:
Q(v) - the sum of all payoffs received (all playouts which passed through this node)
N(v) – a count of the number of times node visited
So Q(v) / N(v) is an estimate of the value of the node v
How does the explore term change in UCB?
Gets bigger when child is not explored but parent is;
Explore term gets smaller when child is explored
C determines exploration-exploitation trade-off (tuned experimentally)
Possibilities when ending the search and making a move in MCTS
Max child (default): Choose the child with the highest average reward (highest Q/N)
Robust child: Choose the child most often visited (highest N)
Max-Robust child: Choose the child which is maximal in both reward and visits. If none exists, run longer until one does exist.
Secure child: Choose child which maximises a lower confidence interval, such as UCB value but with a minus sign instead of a plus sign
Main differences between TD-Learning and MCTS
TD-Learning
- Needs to learn in self play before used as a game playing agent.
- Learns board value-function or Q table
Monte Carlo Tree Search
- Learns while playing the game
- Uses a form of exploration-exploitation strategy (UCT) to select nodes to evaluate.
- Uses random play-out to evaluate them
MCTS summary
Actions from the current board positon are evaluated:
- building a partial subtree from the current position
Selection: to select the part of the tree to expand
Expansion: To expand from the selected node
Simulate the game from that point (e.g. random roll-outs) to reach terminal nodes and estimate the values from the current position
A method for choosing the best action
How can you vary MCTS with a heuristic
Could use heuristics for selection
- Select for expansion the node with the highest value of the heuristic
Could use heuristics for roll-outs
- During roll-outs, use the heuristic to select the action for each player
- But the heuristic must be quick to calculate
Do MCTS algorithms rely on repeated random sampling?
Yes
Use randomness to find answers which might be deterministic in principle