C6 Flashcards
what is zero sum?
competitive games: the win of player A is the loss of player B
what is decision complexity
the number of end positions that define the value of the initial game position. the larger the number of actions in a position, the larger the decision complexity
what is state space complexity?
the number of legal positions reachable from the initial position of a game
what is AlphaGo?
it is a program that beat the human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games
name the 3 categories of programs that play Go
minimax-style programs, MCTS-based programs and the AlphaGo programs (MCTS combined with deep self-play)
what is AlphaGo Zero?
it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo
how does a self-learning system work?
- searcher uses the evaluation network to estimate reward values and policy actions, and the search results are used in games against the opponent in self-play
- the game results are collected in a buffer, which is used to train the evaluation network in self-learning
- by playing a tournament against a copy of ourselves a virtuous cycle of ever-increasing function improvement is created
what are the 3 levels of self-play?
playing a against a copy of yourself at:
1. move-level: in MCTS playouts our opponent is a copy of ourselves
2. example-level: input for self training the approximator for the policy and the reward functions is generated by our own games
3. tournament-level: create a training curriculum that starts tabula rasa and ends at world champion level
2 advantages of MCTS over minimax and alpha-beta
- it is based on averaging single lines of play instead of traversing subtrees recursively
- it does not need a heuristic evaluation, it knows at the end if we have a win or a loss
what are the 4 operations of MCTS?
- select: traverse the tree from root to a leaf using UCT
- expand: add a child to the tree at each step
- playout: play random moves until the end of the game (self-play)
- backpropagation: propagate the reward back upwards in the tree
what does UCT do?
it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)
second term is for exploration
what is P-UCT?
predictor-UCT: it uses input from the policy head of the deep network
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)
what is tabula rasa learning?
it is when you start to learn with zero knowledge: only self-play and a single neural network
what is curriculum learning?
the network is trained in many small steps, starting against a very weak opponent. As our level of play increases, so does the difficulty of the moves that our teacher proposes to us.
what are the 2 architectural elements of AlphaGo Zero?
A neural network and MCTS