Machine Learning in Games (part of exam 2/3) Flashcards

1
Q

What applications for Reinforcement Learning are there in Games?

A
  • Tic-Tac-Toe: MANACE (Michie 1963)
  • Backgammon: TD-Gammon (Tesauro 1995)
  • Chess: KnightCap (Baxter et al 2000)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Apart from in Games, in what other areas is Reinforcement Learning being applied?

A
  • Elevator Dispatching
  • Robot Control
  • Job Scheduling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How did MENACE (Michie 1963) learn to play Tic-Tac-Toe?

A

287 matchboxes (one for each position in the game) and beads in nine different colors (one color per square) were used to represent the game and the symbols positions.
Every turn the matchbox of the current position was selected and one of the beads was randomly chosen. The bead would instruct where the next symbol was supposed to be placed.
If the game was lost, the beads that had been chosen over the course of the game, were eliminated from the setting (negative reinforcement).
If the game was a draw, the beads were put back into the matchboxes.
If the game was won, double of the same colored beads were put back into the matchboxes (positive reinforcement).

This results in an increased likelihood that a successful move will be tried again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Credit Assignment Problem and how can it be solved?

A

In games that use reinforcement learning, it isn’t until the end of the game that the learner knows if he has won or lost and he doesn’t know which move(s) are responsible for the win/loss (Delayed Reward).

This can be solved by rewarding/penalizing all moves of the game. Over many games this will lead to bad moves rarely receiving a positive feedback.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What were TD-Gammons improvements over MENACE?

A
  • faster convergence because of Temporal Difference Learning
  • uses neural networks: allows generalization
  • use of positional characteristics as features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What were KnightCaps improvements over TD-Gammon?

A
  • integration of Temporal Difference Learning with deep searches
  • self-play training was replaced with training by playing against various partners on the internet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is Minimax or Alpha-Beta not applicable to games of chance?

A

The players can’t directly maximize their gains because they don’t know what the opponant’s chance-element will yield. They don’t know what moves the opponant can legally make until after they have finished their own move.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What search method is used for games of chance?

A

standard game trees extended with chance nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In games of chance, what is the expected value and how is it calculated?

A

At each chance node an average of the expected value of the chance-element is calculated.

expected value (n) = sum of (probability P * outcome)

example: roll of a die
* every outcome has a chance of 1/6 –> probability P
* the number you roll –> outcome

so the expected value of a dice roll would be:
1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In games of chance, how can the expected value be calculated if the exact probability of the outcomes is not known?

A

Then the computation of the expected value is replaced with the average outcome over a large number of random games.
The more random games are played, the better the approximation of the value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Simulation Search?

A

When the complete tree is not searchable, it’s possible to limit the breadth of the search tree: Only a few samples are searched but those are searched to the full depth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some examples of Simulation Search in games?

A
  • roll-out analysis in Backgammon: play a large number of games from the same position, each game has a different dice roll
  • Scrabble: draw fro the bag of remaining tiles, put the tiles back in, draw again. repeat until you’ve reached a big enough sample size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a Simulation Search, what is an Algorithm Sketch?

A

Estimate the expected value of each move by counting the number of wins in a series of completed games.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Monte-Carlo Search?

A

It’s an extreme case of Simulation Search, where a lot of games are played completely randomly. Both players choose their moves randomly, then the average score of these games is calculated.
After the training phase, the machine makes the moves that have the highest average scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or False: the Monte-Carlo Search can’t be used in deterministic games, only in games of chance.

A

False. It has been used with some success in GO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Multi-Armed Bandit Problem?

A

A one-armed bandit is a gambling machine where an action (pulling a lever) leads to a fixed long-term expected reward/loss.
A multi-armed bandit problem consists of multiple such machines which adds the additional problem of having to choose which machine to operate in order to gain the most profit (we don’t know which machine is the most profitable or even if there is a most profitable machine).

17
Q

In regards to Machine Learning Problems, what is regret?

A

The difference between the actual winnings and the winnings you would have gotten had you always made the unknown optimal move.

18
Q

What’s the problem with the Upper Confidence Bound Algorithm?

A

Exploitation vs Exploration

Exploitation: only play move that I know will pay off
Exploration: try other moves to see if they’re more promising.

19
Q

Explain the Upper Confidence Bound Algorithm.

A

It’s an algorithm that combines two strategies: do I play a move that gained winning before or do I play a new move to see if it is even more promising?

UCB = average winnings of x plays using one specific move + inverse percentage of times specific move has been made

Calculate UCB for all possible moves, choose the one with the highest value.

20
Q

What is the Monte-Carlo Tree Search?

A

It’s a combination of game-tree searches and the Monte-Carlo Search, by building a search tree incrementally (which means, with every move played the tree grows). Leaf nodes are evaluated via roll-outs (played many many times, results are then averaged).
Good moves are evaluated more often than bad ones. This leads to a tree that grows deeper in regions of good moves.

21
Q

What two policies/strategies does the Monte-Carlo Tree Search use?

A

It uses two policies.
* Tree Policy: find the best leaf node, then choose the moves that lead you to it.
* Rollout Policy (or Default Policy): play the node many times, calculate the average result.

22
Q

What does UCT Search stand for?

A

Upper Confidence Tree Search

23
Q

What is UCT Search?

A

It’s a form of MCTS, it combines a UCB-based tree policy with random roll-outs.

24
Q

What are the steps of a UCT Search?

A
  1. Selection
  2. Expansion
  3. Simulation
  4. Backpropagation
  5. Move Choice
25
Q

In regards to UCT Search, explain the step of Selection.

A

It calculates what node to choose next.

UCB Algorithm combines Exploitation + Exploration:
UCB = average winnings of x plays using one specific move + inverse percentage of times specific move has been made

26
Q

In regards to UCT Search, explain the step of Expansion.

A

Add a randomly selected node to the game tree.

27
Q

In regards to UCT Search, explain the step of Simulation.

A

Perform one iteration of a Monte-Carlo Search starting from the selected node until a leaf node.

28
Q

In regards to UCT Search, explain the step of Backpropagation.

A

Adapt the value for each node in the simulated and expanded game tree. The value is the average result of all games that pass through this node.

29
Q

In regards to UCT Search, explain the step of Move Choice.

A

Make the move that has the deepest tree/that has been visited most often. It isn’t necessarily the one with the highest value.

30
Q

What is AlphaGo? What techniques did it use?

A

GO-playing machine using
* deep learning
* reinforcement learning through self-play

It used a learned roll-out policy instead of sampling, and a depth-limit in MCTS where a learned evaluation function was used instead of real game outcomes.

31
Q

What are the two types of learned networks?

A
  • Policy Networks: what’s the probability with which I’m gonna make a certain move in a certain position?
  • Value Networks: what’s the probability that I will win this position?
32
Q

What were the four learning steps of AlphaGo?

A
  1. learn from expert games a fast but inaccurate roll-out policy (for guidance in an MCTS algorithm)
  2. learn from expert games an accurate expert-policy (to calculate prior probabilities in new nodes)
  3. refine the expert policy into a more accurate selection policy (reinforcement learning from self-play)
  4. use self-play to train an utility function (to evaluate a given position)
33
Q

What is AlphaGo Zero?

A

An improved version of AlphaGo.
It didn’t learn from expert games but only from self-play.

34
Q

What is Opponent Modeling?

A

It’s the ability to use prior knowledge and observations in order to predict the behavior of an opponent.

35
Q

What is an example for Opponent Modeling being part of a game?

A

Poker: bluffing, calling a bluff

36
Q

What is the difference between optimal play and maximal play on the example of Rock-Paper-Scissors?

A

A maximal player will deviate from the optimal strategy to exploit the observed weak points in the opponent’s play.

Rock-Paper-Scissors:
The optimal strategy would be to always choose randomly.
But if I notice that my opponent favors one move (ie Paper), the maximal strategy would be to favor Scissors in order to win.