C6 Flashcards

Question 1

Q

what is zero sum?

Answer

A

competitive games: the win of player A is the loss of player B

Question 2

Q

what is decision complexity

Answer

A

the number of end positions that define the value of the initial game position. the larger the number of actions in a position, the larger the decision complexity

Question 3

Q

what is state space complexity?

Answer

A

the number of legal positions reachable from the initial position of a game

Question 4

Q

what is AlphaGo?

Answer

A

it is a program that beat the human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games

Question 5

Q

name the 3 categories of programs that play Go

Answer

A

minimax-style programs, MCTS-based programs and the AlphaGo programs (MCTS combined with deep self-play)

Question 6

Q

what is AlphaGo Zero?

Answer

A

it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo

Question 7

Q

how does a self-learning system work?

Answer

A

searcher uses the evaluation network to estimate reward values and policy actions, and the search results are used in games against the opponent in self-play
the game results are collected in a buffer, which is used to train the evaluation network in self-learning
by playing a tournament against a copy of ourselves a virtuous cycle of ever-increasing function improvement is created

Question 8

Q

what are the 3 levels of self-play?

Answer

A

playing a against a copy of yourself at:
1. move-level: in MCTS playouts our opponent is a copy of ourselves
2. example-level: input for self training the approximator for the policy and the reward functions is generated by our own games
3. tournament-level: create a training curriculum that starts tabula rasa and ends at world champion level

Question 9

Q

2 advantages of MCTS over minimax and alpha-beta

Answer

A

it is based on averaging single lines of play instead of traversing subtrees recursively
it does not need a heuristic evaluation, it knows at the end if we have a win or a loss

Question 10

Q

what are the 4 operations of MCTS?

Answer

A

select: traverse the tree from root to a leaf using UCT
expand: add a child to the tree at each step
playout: play random moves until the end of the game (self-play)
backpropagation: propagate the reward back upwards in the tree

Question 11

Q

what does UCT do?

Answer

A

it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)

second term is for exploration

Question 12

Q

what is P-UCT?

Answer

A

predictor-UCT: it uses input from the policy head of the deep network
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)

Question 13

Q

what is tabula rasa learning?

Answer

A

it is when you start to learn with zero knowledge: only self-play and a single neural network

Question 14

Q

what is curriculum learning?

Answer

A

the network is trained in many small steps, starting against a very weak opponent. As our level of play increases, so does the difficulty of the moves that our teacher proposes to us.

Question 15

Q

what are the 2 architectural elements of AlphaGo Zero?

Answer

A

A neural network and MCTS

Question 16

Q

what is minimax?

Answer

Study These Flashcards

A

the root node chooses the child with the maximum value, the next level chooses the minimum value (opponent) and so on

Question 17

Q

what is the estimated state space in Go

Answer

Study These Flashcards

A

10^170
(chess is 10^47)

Question 18

Q

what are the 2 architectural elements of conventional chess programs?

Answer

Study These Flashcards

A

alpha-beta and a heuristic evaluation function

Question 19

Q

what is the biggest problem that was overcome by AlphaGo Zero?

Answer

Study These Flashcards

A

instability

Question 20

Q

how was stability achieved in AlphaGo Zero?

Answer

Study These Flashcards

A

coverage of the state space: playing a large number of games and MCTS look-ahead
correlation is reduced through experience replay buffer
convergence of training is improved by using on-policy MCTS and taking small training steps (small learning rate)

Question 21

Q

why is AlphaGo Zero faster than Alpha Go?

Answer

Study These Flashcards

A

it uses curriculum learning

C6 Flashcards

(21 cards)