chapter 9 Flashcards
deep q-learning
The DeepMind group combined Q- learning with deep neural networks
deep Q-learning is exactly the same as q-learning, except that a convolutional neural network takes the place of the Q-table.
the Deep Q-Network (DQN)
The input:
the state of the system at a given time
The outputs:
the estimated values for each possible action, given the input state.
The network itself is a ConvNet
> Instead of the values in a Q-table, the weights in this network that are learned.
the state of a DQN
the current “frame”: the pixels of the current screen
plus three prior frames
This definition of state provides the system with a small amount of memory
updating weights with supervised learning
back-propagation works by changing a neural network’s weights so as to reduce the error in the network’s outputs.
learning to match its outputs to human-given labels
updating weights with reinforcement learning
temporal difference learning
the network learns to make its outputs consistent from one iteration to the next, assuming that later iterations give better estimates of value than earlier iterations
q learning process
- The system gives its current state as input to the Deep Q-Network.
- The Deep Q-Network outputs a value for each possible action.
- The system chooses and performs an action, resulting in a new state.
- Now the learning step takes place: the system inputs its new state to the network, which outputs a new set of values for each action.
- The difference between the new set of values and the previous set of values is considered the “error” of the network; this error is used by back- propagation to change the weights of the network.
minimax
Once each of the bottom-row board positions was thus evaluated, the minimax uses these values—from the end of the look-ahead process—in order to rate the program’s immediate possible moves from its current board position.
The program then chooses the highest-rated move.
o after an evaluation function assigns value to all possible board positions, the evaluations are propagated back up the tree by minimax so that the highest rated move can be chosen
Deep Blue used much the same method as Samuel’s checkers player:
at a given turn, it created a partial game tree using the current board position as the root
it applied its evaluation function to the furthest layer in the tree
then used the minimax algorithm to propagate the values up the tree in order to determine which move it should make.
The major differences between Samuel’s program and Deep Blue
- Deep Blue’s deeper look- ahead in its game tree
- its more complex (chess-specific) evaluation function,
- hand-programmed chess knowledge
- specialized parallel hardware to make it run very fast.
- unlike Samuel’s checkers- playing program, Deep Blue did not use machine learning in any central way.
Go programs are not able to use this strategy for two reasons.
- the size of a look-ahead tree in Go is dramatically larger than that in chess.
- no one has succeeded in creating a good evaluation function for Go board positions.
> AI researchers haven’t figured out how to encode intuition into an evaluation function.
AlphaGo
Unlike Deep Blue, AlphaGo acquired its abilities by reinforcement learning via self-play.
used an intricate mix of deep Q- learning, “Monte Carlo tree search,” supervised learning, and specialized Go knowledge.
Monte Carlo Tree Search (MCTS)
The roulette-wheel-inspired randomness is used in deciding how to do the look-ahead.
imagines (that is, simulates) several possible ways the game could play out, counts how many wins and losses those hypothetical sequences lead to, and uses those counts to give a score to each of its possible moves.
the program chooses moves probabilistically based on any scores that those moves might have from previous rounds of Monte Carlo tree search.
At first, the program’s choice of moves from a given board position is quite random (the program is doing the equivalent of spinning a roulette wheel to choose a move), but as the program performs additional roll-outs, generating additional statistics, it is increasingly biased to choose those moves that in past roll-outs led to the most wins.
In this way, Monte Carlo tree search doesn’t have to guess from just looking at a board position which move is most likely to lead to a win; it uses its roll-outs to collect statistics on how many times a given move actually leads to a win or loss.
MCTS rollout
Such a simulation, starting from a given board position, is called a roll-out from that position
The DeepMind group realized that they could improve their system by
complementing Monte Carlo tree search with a deep convolutional neural network.
o Given the current board position as input, AlphaGo uses a trained deep convolutional neural network to assign a rough value to all possible moves from the current position. Then Monte Carlo tree search uses those values to kick-start its search: rather than initially choosing moves at random, Monte Carlo tree search uses values output by the ConvNet as an indicator of which initial moves should be preferred
evaluation function
used to assign a score to each possible move from a given board position