C6 Flashcards
what is zero sum?
competitive games: the win of player A is the loss of player B
what is decision complexity
the number of end positions that define the value of the initial game position. the larger the number of actions in a position, the larger the decision complexity
what is state space complexity?
the number of legal positions reachable from the initial position of a game
what is AlphaGo?
it is a program that beat the human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games
name the 3 categories of programs that play Go
minimax-style programs, MCTS-based programs and the AlphaGo programs (MCTS combined with deep self-play)
what is AlphaGo Zero?
it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo
how does a self-learning system work?
- searcher uses the evaluation network to estimate reward values and policy actions, and the search results are used in games against the opponent in self-play
- the game results are collected in a buffer, which is used to train the evaluation network in self-learning
- by playing a tournament against a copy of ourselves a virtuous cycle of ever-increasing function improvement is created
what are the 3 levels of self-play?
playing a against a copy of yourself at:
1. move-level: in MCTS playouts our opponent is a copy of ourselves
2. example-level: input for self training the approximator for the policy and the reward functions is generated by our own games
3. tournament-level: create a training curriculum that starts tabula rasa and ends at world champion level
2 advantages of MCTS over minimax and alpha-beta
- it is based on averaging single lines of play instead of traversing subtrees recursively
- it does not need a heuristic evaluation, it knows at the end if we have a win or a loss
what are the 4 operations of MCTS?
- select: traverse the tree from root to a leaf using UCT
- expand: add a child to the tree at each step
- playout: play random moves until the end of the game (self-play)
- backpropagation: propagate the reward back upwards in the tree
what does UCT do?
it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)
second term is for exploration
what is P-UCT?
predictor-UCT: it uses input from the policy head of the deep network
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)
what is tabula rasa learning?
it is when you start to learn with zero knowledge: only self-play and a single neural network
what is curriculum learning?
the network is trained in many small steps, starting against a very weak opponent. As our level of play increases, so does the difficulty of the moves that our teacher proposes to us.
what are the 2 architectural elements of AlphaGo Zero?
A neural network and MCTS
what is minimax?
the root node chooses the child with the maximum value, the next level chooses the minimum value (opponent) and so on
what is the estimated state space in Go
10^170
(chess is 10^47)
what are the 2 architectural elements of conventional chess programs?
alpha-beta and a heuristic evaluation function
what is the biggest problem that was overcome by AlphaGo Zero?
instability
how was stability achieved in AlphaGo Zero?
- coverage of the state space: playing a large number of games and MCTS look-ahead
- correlation is reduced through experience replay buffer
- convergence of training is improved by using on-policy MCTS and taking small training steps (small learning rate)
why is AlphaGo Zero faster than Alpha Go?
it uses curriculum learning