Week 6 Flashcards

Question 1

Q

What class of algorithm is Q learning part of?

Answer

A

Temporal Difference (TD) Learning or TD(0) Learning

Question 2

Q

What is the goal of temporal difference learning?

Answer

A

Predict expected (possible discounted) future reward:
- Given a particular policy for choosing subsequent actions.

Question 3

Q

What is a policy in temporal difference learning?

Answer

A

The choice of action over a given state.

at = pi(st)

Policy is conventionally denoted pi in reinforcement learning. Called strategy in game theory

Question 4

Q

Does Q-Learning have to get to the terminal node to get information?

Question 5

Q

What is Deep-Q Learning?

Answer

A

Deep-Q Learning uses a neural network to approximate the Q-function.

Often a convolutional neural network

Convolutional neural networks are very effective at large image processing
- Treats the input as an image rather than a collection of features
- e.g. the board

Used by DeepMind to create agents that can play video games. Later to learn to play Go.

Question 6

Q

Convolutional Neural Networks

Answer

A

Learns local feature detectors
Weights constrained to learn the same feature at different parts of the image
Higher level layers are building features of features

Question 7

Q

What assumption in statistics, machine learning, data science, etc.. makes learning in games less effective?

Answer

A

The data is independent and stationary

But in games, subsequent data are not independent.

And in video games, subsequent data are frames - highly correlated.

Question 8

Q

What is the difference between TD-Learning and Monte Carlo learning styles

Answer

A

In TD-Learning you learn, then you play

In Monte-Carlo learning you learn while you play. You learn between moves. (Like minimax search).

Question 9

Q

Monte Carlo Board Evaluation

Answer

A

Simple way to generate a heuristic function. Often called “playouts”.

For position v, do the following N times:
- Play the game to the end, all players choose random moves.
- Return the average payoff

Estimates the strength of the board position by the average payoff to the nodes accessible to it.

Downside is the time cost to perform the N playouts per board position. Although you are not playing against a random play, it can be very useful

Question 10

Q

What is MCTS

Answer

A

Monte Carlo Tree Search

Question 11

Q

Elements of MCTS

Answer

A

Game tree is built, incrementally but asymmetrically.

For each iteration, a tree policy is used to find the most urgent node to expand, using a strategy which balances exploration and exploitation.

A simulation is run from the selected node using a default policy (often random choice of action)

Update the search tree according to the result. This updates the value of the selected node and all its ancestors.

Return the best child of the current root node

Question 12

Q

What are the four phases of MCTS?

Answer

A

Selection
- Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children

Expansion
- Increase the size of the current partial tree (typically by one node) by expanding nodes (e.g. visit and unvisited child)

Play-out (simulation)
Run a simulation from the added node by random self-play.

Back-up
The statistics of the added node(s) and its ancestors are updated

Question 13

Q

What is UCT?

Answer

A

Upper confidence trees is a preferred selection method
- An exploration strategy, like epsilon greedy
- Always chooses unselected children
- Between two children of the same parent with the same average value, chooses child selected less often.
- In the long run, children with lower true values will be chosen less often. The best child will be chosen exponentially more often.

Question 14

Q

In UCT what does each node store?

Answer

A

Two quantities:
Q(v) - the sum of all payoffs received (all playouts which passed through this node)

N(v) – a count of the number of times node visited

So Q(v) / N(v) is an estimate of the value of the node v

Question 15

Q

How does the explore term change in UCB?

Answer

A

Gets bigger when child is not explored but parent is;

Explore term gets smaller when child is explored

C determines exploration-exploitation trade-off (tuned experimentally)

Question 16

Q

Possibilities when ending the search and making a move in MCTS

Answer

A

Max child (default): Choose the child with the highest average reward (highest Q/N)

Robust child: Choose the child most often visited (highest N)

Max-Robust child: Choose the child which is maximal in both reward and visits. If none exists, run longer until one does exist.

Secure child: Choose child which maximises a lower confidence interval, such as UCB value but with a minus sign instead of a plus sign

Question 17

Q

Main differences between TD-Learning and MCTS

Answer

A

TD-Learning
- Needs to learn in self play before used as a game playing agent.
- Learns board value-function or Q table

Monte Carlo Tree Search
- Learns while playing the game
- Uses a form of exploration-exploitation strategy (UCT) to select nodes to evaluate.
- Uses random play-out to evaluate them

Question 18

Q

MCTS summary

Answer

A

Actions from the current board positon are evaluated:
- building a partial subtree from the current position
Selection: to select the part of the tree to expand
Expansion: To expand from the selected node

Simulate the game from that point (e.g. random roll-outs) to reach terminal nodes and estimate the values from the current position

A method for choosing the best action

Question 19

Q

How can you vary MCTS with a heuristic

Answer

A

Could use heuristics for selection
- Select for expansion the node with the highest value of the heuristic

Could use heuristics for roll-outs
- During roll-outs, use the heuristic to select the action for each player
- But the heuristic must be quick to calculate

Question 20

Q

Do MCTS algorithms rely on repeated random sampling?

Answer

A

Yes

Use randomness to find answers which might be deterministic in principle