W6 Two Agent RL Flashcards

Question 1

Q

What are the differences between AlphaGo, AlphaGo Zero, and AlphaZero?

Answer

A

AlphaGo: it is a program that beat human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games

AlphaGoZero: it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo, faster than AlphaGo because it uses curriculum learning
* the biggest problem that was overcome by AlphaGo Zero: instability
* 2 architectural elements of AlphaGo Zero?
A neural network with 2 heads and MCTS player

AlphaZero is a generalization of this program that
also plays chess and shogi.

Question 2

Q

More details about AlphaGo Zero

Answer

A

AlphaGo Zero uses a model-based actor critic approach with a planner that improves a single value/policy network. For policy improvement it uses MCTS, for learning a single deep residual network with a policy head and a value head (Sect. B.2.6), see Fig. 6.9.

MCTS improves the quality of the training examples in each iteration (left panel), and the net is trained with these better examples, improving its quality (right panel).

Question 3

Q

how was stability achieved in AlphaGo Zero?

Answer

A

coverage of the state space: playing a large number of games and MCTS look-ahead

correlation is reduced through experience replay buffer

convergence of training is improved by using on-policy MCTS and taking small training steps (small learning rate)

Question 4

Q

What is MCTS? Why is MCTS more useful than minimax?

Answer

A

Monte Carlo Tree Search
MCTS is a variable depth adaptive search, planning-based algorithm, that did not need a heuristic function, but instead used random playouts to estimate board strength.

MCTS has 2 advantages:
1. First, MCTS is based on averaging single lines of play, instead of recursively traversing subtrees. The computational complexity of a path from the root to a leaf is polynomial in the search depth.
2. Second, MCTS does not need a heuristic evaluation function. It plays out a line of play in the game from the root to an end position

Question 5

Q

Is MCTS on-policy or off-policy?

Answer

A

MCTS is on-policy: the values that are backed up are those of the nodes that were selected.

Question 6

Q

What are the four steps of MCTS?

Answer

A

select: traverse the tree from root to a leaf using UCT
expand: add a child to the tree at each step
playout/rollout/sampling: play random moves until the end of the game (self-play)
backpropagation: propagate the reward back upwards in the tree

Question 7

Q

What does UCT do?

Answer

A

it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)

second term is for exploration

Question 8

Q

Give the UCT formula. How is P-UCT different?

Answer

A

DRL book Page 168

it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)

balance between winrate and newness: second term is for exploration

predictor-UCT: it uses input from the policy head of the deep network, in addition to winrate and newness, the policy is added to the exploration part
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)

Question 9

Q

What is zero-sum? How to implement it?

Answer

A

In two-agent RL, my win is your loss and vice versa. The two agents works against each other.

A popular way to do so is to implement the environment’s actions with self-play:
we replace the environment by a copy of ourselves. In this way we let ourselves play against an opponent that has all the knowledge that we currently have (both agents know the transition function/the rules), and agents learn from each other.

Question 10

Q

3 levels of self-play?

Answer

A

Move-level: in the MCTS playouts, our opponent actually is a copy of ourselves, we use the agent to play against itself, as its own opponent—hence, self-play at the level of game moves
Example-level: the input for self-training the approximator for the policy and the reward functions is generated by our own games —hence, self-play at the level of the value/policy network.
Tournament-level: the self-play loop creates a training
curriculum that starts tabula rasa and ends at world champion level. The system trains at the level of the player against itself—hence, self-play, of the third kind.

Question 11

Q

Describe the function of each of the four operations of MCTS.

Answer

A

Select In the selection step the tree is traversed from the root node down until a leaf of the MCTS search tree is reached where a new child is selected that is not part of the tree yet. At each internal state the selection rule is followed to determine which action to take and thus which state to go to next. The UCT rule
works well in many applications. The selections at these states are part of the policy 𝜋(𝑠) of actions of the state.
Expand Then, in the expansion step, a child is added to the tree. In most cases only one child is added. In some MCTS versions all successors of a leaf are added
to the tree.
Playout Subsequently, during the playout step random moves are played in a form of self-play until the end of the game is reached. (These nodes are not added to the MCTS tree, but their search result is, in the backpropagation step.) The reward 𝑟 of this simulated game is +1 in case of a win for the first player, 0 in
case of a draw, and −1 in case of a win for the opponent
Backpropagation In the backpropagation step, reward 𝑟 is propagated back upwards in the tree, through the nodes that were traversed down previously. Two counts are updated: the visit count, for all nodes, and the win count, depending on the reward value. Note that in a two-agent game, nodes in the MCTS tree
alternate color. If white has won, then only white-to-play nodes are incremented; if black has won, then only the black-to-play nodes.

Question 12

Q

How does UCT achieve trading off exploration and exploitation? Which inputs does it use?

Answer

A

Tune Cp, the formula uses the number of wins in child a, number of times child a has been visited and number of times the parent node has been visited.

Question 13

Q

When 𝐶𝑝 is small, does MCTS explore more or exploit more?

Answer

A

Larger Cp means more exploration
Smaller Cp mean more exploitation

Question 14

Q

Whats the state space complexity for Chess, GO?

Answer

A

10^47 for chess, 10^170 for Go

Question 15

Q

For small numbers of node expansions, would you prefer more exploration or more exploitation?

Question 16

Q

What is a double-headed network? How is it different from regular actor critic?

Answer

Study These Flashcards

A

one network with two heads, for example the residual network used for AlphaGoZero has a value head and a policy head.

Regular AC can build value network and policy network separately

Question 17

Q

Which three elements make up the self-play loop? (You may draw a picture.)

Answer

Study These Flashcards

A

policy/value, transition rules, opponent
(DRL Book Page 158)

In comparison with the model-based world of Chap. 5 (Fig. 6.8) our learned model has been replaced by perfect knowledge of the transition rules, and the environment is now called opponent: the negative version of the same agent playing the role of agent.

Question 18

Q

What is tabula rasa learning?

Answer

Study These Flashcards

A

it is when you start to learn with zero knowledge: only self-play and a single neural network

Question 19

Q

How can tabula rasa learning be faster than reinforcement learning on top of supervised learning of grandmaster games?

Answer

Study These Flashcards

A

self-play is faster because it creates a sequence of learning tasks that are ordered from easy to hard. Training such an ordered sequence of small tasks is quicker than one large unordered task.

Question 20

Q

What is curriculum learning?

Answer

Study These Flashcards

A

Curriculum learning starts the training process with easy concepts before the hard concepts are learned. In curriculum learning the examples are ordered in batches from easy to hard.

Question 21

Q

Chap 6 Core Concepts & algorithms

Answer

Study These Flashcards

A

self-play
curriculum learning
minimax, MCTS (Monte Carlo Tree Search), AlphaZero tabula rasa learning

W6 Two Agent RL Flashcards

(21 cards)