W6 Two Agent RL Flashcards
What are the differences between AlphaGo, AlphaGo Zero, and AlphaZero?
AlphaGo: it is a program that beat human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games
AlphaGoZero: it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo, faster than AlphaGo because it uses curriculum learning
* the biggest problem that was overcome by AlphaGo Zero: instability
* 2 architectural elements of AlphaGo Zero?
A neural network with 2 heads and MCTS player
AlphaZero is a generalization of this program that
also plays chess and shogi.
More details about AlphaGo Zero
AlphaGo Zero uses a model-based actor critic approach with a planner that improves a single value/policy network. For policy improvement it uses MCTS, for learning a single deep residual network with a policy head and a value head (Sect. B.2.6), see Fig. 6.9.
MCTS improves the quality of the training examples in each iteration (left panel), and the net is trained with these better examples, improving its quality (right panel).
how was stability achieved in AlphaGo Zero?
coverage of the state space: playing a large number of games and MCTS look-ahead
correlation is reduced through experience replay buffer
convergence of training is improved by using on-policy MCTS and taking small training steps (small learning rate)
What is MCTS? Why is MCTS more useful than minimax?
Monte Carlo Tree Search
MCTS is a variable depth adaptive search, planning-based algorithm, that did not need a heuristic function, but instead used random playouts to estimate board strength.
MCTS has 2 advantages:
1. First, MCTS is based on averaging single lines of play, instead of recursively traversing subtrees. The computational complexity of a path from the root to a leaf is polynomial in the search depth.
2. Second, MCTS does not need a heuristic evaluation function. It plays out a line of play in the game from the root to an end position
Is MCTS on-policy or off-policy?
MCTS is on-policy: the values that are backed up are those of the nodes that were selected.
What are the four steps of MCTS?
select: traverse the tree from root to a leaf using UCT
expand: add a child to the tree at each step
playout/rollout/sampling: play random moves until the end of the game (self-play)
backpropagation: propagate the reward back upwards in the tree
What does UCT do?
it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)
second term is for exploration
Give the UCT formula. How is P-UCT different?
DRL book Page 168
it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)
balance between winrate and newness: second term is for exploration
predictor-UCT: it uses input from the policy head of the deep network, in addition to winrate and newness, the policy is added to the exploration part
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)
What is zero-sum? How to implement it?
In two-agent RL, my win is your loss and vice versa. The two agents works against each other.
A popular way to do so is to implement the environment’s actions with self-play:
we replace the environment by a copy of ourselves. In this way we let ourselves play against an opponent that has all the knowledge that we currently have (both agents know the transition function/the rules), and agents learn from each other.
3 levels of self-play?
- Move-level: in the MCTS playouts, our opponent actually is a copy of ourselves, we use the agent to play against itself, as its own opponent—hence, self-play at the level of game moves
- Example-level: the input for self-training the approximator for the policy and the reward functions is generated by our own games —hence, self-play at the level of the value/policy network.
- Tournament-level: the self-play loop creates a training
curriculum that starts tabula rasa and ends at world champion level. The system trains at the level of the player against itself—hence, self-play, of the third kind.
Describe the function of each of the four operations of MCTS.
- Select In the selection step the tree is traversed from the root node down until a leaf of the MCTS search tree is reached where a new child is selected that is not part of the tree yet. At each internal state the selection rule is followed to determine which action to take and thus which state to go to next. The UCT rule
works well in many applications. The selections at these states are part of the policy 𝜋(𝑠) of actions of the state. - Expand Then, in the expansion step, a child is added to the tree. In most cases only one child is added. In some MCTS versions all successors of a leaf are added
to the tree. - Playout Subsequently, during the playout step random moves are played in a form of self-play until the end of the game is reached. (These nodes are not added to the MCTS tree, but their search result is, in the backpropagation step.) The reward 𝑟 of this simulated game is +1 in case of a win for the first player, 0 in
case of a draw, and −1 in case of a win for the opponent - Backpropagation In the backpropagation step, reward 𝑟 is propagated back upwards in the tree, through the nodes that were traversed down previously. Two counts are updated: the visit count, for all nodes, and the win count, depending on the reward value. Note that in a two-agent game, nodes in the MCTS tree
alternate color. If white has won, then only white-to-play nodes are incremented; if black has won, then only the black-to-play nodes.
How does UCT achieve trading off exploration and exploitation? Which inputs does it use?
Tune Cp, the formula uses the number of wins in child a, number of times child a has been visited and number of times the parent node has been visited.
When 𝐶𝑝 is small, does MCTS explore more or exploit more?
Larger Cp means more exploration
Smaller Cp mean more exploitation
Whats the state space complexity for Chess, GO?
10^47 for chess, 10^170 for Go
For small numbers of node expansions, would you prefer more exploration or more exploitation?