Week 6 Flashcards
What class of algorithm is Q learning part of?
Temporal Difference (TD) Learning or TD(0) Learning
What is the goal of temporal difference learning?
Predict expected (possible discounted) future reward:
- Given a particular policy for choosing subsequent actions.
What is a policy in temporal difference learning?
The choice of action over a given state.
at = pi(st)
Policy is conventionally denoted pi in reinforcement learning. Called strategy in game theory
Does Q-Learning have to get to the terminal node to get information?
Yes
What is Deep-Q Learning?
Deep-Q Learning uses a neural network to approximate the Q-function.
Often a convolutional neural network
Convolutional neural networks are very effective at large image processing
- Treats the input as an image rather than a collection of features
- e.g. the board
Used by DeepMind to create agents that can play video games. Later to learn to play Go.
Convolutional Neural Networks
- Learns local feature detectors
- Weights constrained to learn the same feature at different parts of the image
- Higher level layers are building features of features
What assumption in statistics, machine learning, data science, etc.. makes learning in games less effective?
The data is independent and stationary
But in games, subsequent data are not independent.
And in video games, subsequent data are frames - highly correlated.
What is the difference between TD-Learning and Monte Carlo learning styles
In TD-Learning you learn, then you play
In Monte-Carlo learning you learn while you play. You learn between moves. (Like minimax search).
Monte Carlo Board Evaluation
Simple way to generate a heuristic function. Often called “playouts”.
For position v, do the following N times:
- Play the game to the end, all players choose random moves.
- Return the average payoff
Estimates the strength of the board position by the average payoff to the nodes accessible to it.
Downside is the time cost to perform the N playouts per board position. Although you are not playing against a random play, it can be very useful
What is MCTS
Monte Carlo Tree Search
Elements of MCTS
Game tree is built, incrementally but asymmetrically.
For each iteration, a tree policy is used to find the most urgent node to expand, using a strategy which balances exploration and exploitation.
A simulation is run from the selected node using a default policy (often random choice of action)
Update the search tree according to the result. This updates the value of the selected node and all its ancestors.
Return the best child of the current root node
What are the four phases of MCTS?
Selection
- Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children
Expansion
- Increase the size of the current partial tree (typically by one node) by expanding nodes (e.g. visit and unvisited child)
Play-out (simulation)
Run a simulation from the added node by random self-play.
Back-up
The statistics of the added node(s) and its ancestors are updated
What is UCT?
Upper confidence trees is a preferred selection method
- An exploration strategy, like epsilon greedy
- Always chooses unselected children
- Between two children of the same parent with the same average value, chooses child selected less often.
- In the long run, children with lower true values will be chosen less often. The best child will be chosen exponentially more often.
In UCT what does each node store?
Two quantities:
Q(v) - the sum of all payoffs received (all playouts which passed through this node)
N(v) – a count of the number of times node visited
So Q(v) / N(v) is an estimate of the value of the node v
How does the explore term change in UCB?
Gets bigger when child is not explored but parent is;
Explore term gets smaller when child is explored
C determines exploration-exploitation trade-off (tuned experimentally)