Machine Learning in Games (part of exam 2/3) Flashcards

Question 1

Q

What applications for Reinforcement Learning are there in Games?

Answer

A

Tic-Tac-Toe: MANACE (Michie 1963)
Backgammon: TD-Gammon (Tesauro 1995)
Chess: KnightCap (Baxter et al 2000)

Question 2

Q

Apart from in Games, in what other areas is Reinforcement Learning being applied?

Answer

A

Elevator Dispatching
Robot Control
Job Scheduling

Question 3

Q

How did MENACE (Michie 1963) learn to play Tic-Tac-Toe?

Answer

A

287 matchboxes (one for each position in the game) and beads in nine different colors (one color per square) were used to represent the game and the symbols positions.
Every turn the matchbox of the current position was selected and one of the beads was randomly chosen. The bead would instruct where the next symbol was supposed to be placed.
If the game was lost, the beads that had been chosen over the course of the game, were eliminated from the setting (negative reinforcement).
If the game was a draw, the beads were put back into the matchboxes.
If the game was won, double of the same colored beads were put back into the matchboxes (positive reinforcement).

This results in an increased likelihood that a successful move will be tried again.

Question 4

Q

What is the Credit Assignment Problem and how can it be solved?

Answer

A

In games that use reinforcement learning, it isn’t until the end of the game that the learner knows if he has won or lost and he doesn’t know which move(s) are responsible for the win/loss (Delayed Reward).

This can be solved by rewarding/penalizing all moves of the game. Over many games this will lead to bad moves rarely receiving a positive feedback.

Question 5

Q

What were TD-Gammons improvements over MENACE?

Answer

A

faster convergence because of Temporal Difference Learning
uses neural networks: allows generalization
use of positional characteristics as features

Question 6

Q

What were KnightCaps improvements over TD-Gammon?

Answer

A

integration of Temporal Difference Learning with deep searches
self-play training was replaced with training by playing against various partners on the internet

Question 7

Q

Why is Minimax or Alpha-Beta not applicable to games of chance?

Answer

A

The players can’t directly maximize their gains because they don’t know what the opponant’s chance-element will yield. They don’t know what moves the opponant can legally make until after they have finished their own move.

Question 8

Q

What search method is used for games of chance?

Answer

A

standard game trees extended with chance nodes

Question 9

Q

In games of chance, what is the expected value and how is it calculated?

Answer

A

At each chance node an average of the expected value of the chance-element is calculated.

expected value (n) = sum of (probability P * outcome)

example: roll of a die
* every outcome has a chance of 1/6 –> probability P
* the number you roll –> outcome

so the expected value of a dice roll would be:
1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

Question 10

Q

In games of chance, how can the expected value be calculated if the exact probability of the outcomes is not known?

Answer

A

Then the computation of the expected value is replaced with the average outcome over a large number of random games.
The more random games are played, the better the approximation of the value.

Question 11

Q

What is Simulation Search?

Answer

A

When the complete tree is not searchable, it’s possible to limit the breadth of the search tree: Only a few samples are searched but those are searched to the full depth.

Question 12

Q

What are some examples of Simulation Search in games?

Answer

A

roll-out analysis in Backgammon: play a large number of games from the same position, each game has a different dice roll
Scrabble: draw fro the bag of remaining tiles, put the tiles back in, draw again. repeat until you’ve reached a big enough sample size.

Question 13

Q

In a Simulation Search, what is an Algorithm Sketch?

Answer

A

Estimate the expected value of each move by counting the number of wins in a series of completed games.

Question 14

Q

What is the Monte-Carlo Search?

Answer

A

It’s an extreme case of Simulation Search, where a lot of games are played completely randomly. Both players choose their moves randomly, then the average score of these games is calculated.
After the training phase, the machine makes the moves that have the highest average scores.

Question 15

Q

True or False: the Monte-Carlo Search can’t be used in deterministic games, only in games of chance.

Answer

A

False. It has been used with some success in GO.

Question 16

Q

What is the Multi-Armed Bandit Problem?

Answer

A

A one-armed bandit is a gambling machine where an action (pulling a lever) leads to a fixed long-term expected reward/loss.
A multi-armed bandit problem consists of multiple such machines which adds the additional problem of having to choose which machine to operate in order to gain the most profit (we don’t know which machine is the most profitable or even if there is a most profitable machine).

Question 17

Q

In regards to Machine Learning Problems, what is regret?

Answer

A

The difference between the actual winnings and the winnings you would have gotten had you always made the unknown optimal move.

Question 18

Q

What’s the problem that can be solved with the Upper Confidence Bound Algorithm?

Answer

A

Exploitation vs Exploration

Exploitation: only play move that I know will pay off
Exploration: try other moves to see if they’re more promising.

Question 19

Q

Explain the Upper Confidence Bound Algorithm.

Answer

A

It’s an algorithm that combines two strategies: do I play a move that gained winning before or do I play a new move to see if it is even more promising?

UCB = average winnings of x plays using one specific move + inverse percentage of times specific move has been made

Calculate UCB for all possible moves, choose the one with the highest value.

Question 20

Q

What is the Monte-Carlo Tree Search?

Answer

A

It’s a combination of game-tree searches and the Monte-Carlo Search, by building a search tree incrementally (which means, with every move played the tree grows). Leaf nodes are evaluated via roll-outs (played many many times, results are then averaged).
Good moves are evaluated more often than bad ones. This leads to a tree that grows deeper in regions of good moves.

Question 21

Q

What two policies/strategies does the Monte-Carlo Tree Search use?

Answer

A

It uses two policies.
* Tree Policy: find the best leaf node, then choose the moves that lead you to it.
* Rollout Policy (or Default Policy): play the node many times, calculate the average result.

Question 22

Q

What does UCT Search stand for?

Answer

A

Upper Confidence Tree Search

Question 23

Q

What is UCT Search?

Answer

A

It’s a form of MCTS, it combines a UCB-based tree policy with random roll-outs.

Question 24

Q

What are the steps of a UCT Search?

Answer

A

Selection
Expansion
Simulation
Backpropagation
Move Choice

Question 25

Q

In regards to UCT Search, explain the step of Selection.

Answer

A

It calculates what node to choose next.

UCB Algorithm combines Exploitation + Exploration:
UCB = average winnings of x plays using one specific move + inverse percentage of times specific move has been made

Question 26

Q

In regards to UCT Search, explain the step of Expansion.

Answer

A

Add a randomly selected node to the game tree.

Question 27

Q

In regards to UCT Search, explain the step of Simulation.

Answer

A

Perform one iteration of a Monte-Carlo Search starting from the selected node until a leaf node.

Question 28

Q

In regards to UCT Search, explain the step of Backpropagation.

Answer

A

Adapt the value for each node in the simulated and expanded game tree. The value is the average result of all games that pass through this node.

Question 29

Q

In regards to UCT Search, explain the step of Move Choice.

Answer

A

Make the move that has the deepest tree/that has been visited most often. It isn’t necessarily the one with the highest value.

Question 30

Q

What is AlphaGo? What techniques did it use?

Answer

A

GO-playing machine using
* deep learning
* reinforcement learning through self-play

It used a learned roll-out policy instead of sampling, and a depth-limit in MCTS where a learned evaluation function was used instead of real game outcomes.

Question 31

Q

What are the two types of learned networks?

Answer

A

Policy Networks: what’s the probability with which I’m gonna make a certain move in a certain position?
Value Networks: what’s the probability that I will win this position?

Question 32

Q

What were the four learning steps of AlphaGo?

Answer

A

learn from expert games a fast but inaccurate roll-out policy (for guidance in an MCTS algorithm)
learn from expert games an accurate expert-policy (to calculate prior probabilities in new nodes)
refine the expert policy into a more accurate selection policy (reinforcement learning from self-play)
use self-play to train an utility function (to evaluate a given position)

Question 33

Q

What is AlphaGo Zero?

Answer

A

An improved version of AlphaGo.
It didn’t learn from expert games but only from self-play.

Question 34

Q

What is Opponent Modeling?

Answer

A

It’s the ability to use prior knowledge and observations in order to predict the behavior of an opponent.

Question 35

Q

What is an example for Opponent Modeling being part of a game?

Answer

A

Poker: bluffing, calling a bluff

Question 36

Q

What is the difference between optimal play and maximal play on the example of Rock-Paper-Scissors?

Answer

A

A maximal player will deviate from the optimal strategy to exploit the observed weak points in the opponent’s play.

Rock-Paper-Scissors:
The optimal strategy would be to always choose randomly.
But if I notice that my opponent favors one move (ie Paper), the maximal strategy would be to favor Scissors in order to win.