ml-games-flashcards

Question 1

Q

What is the key idea of reinforcement learning?

Answer

A

Reinforcement learning involves learning through interaction with an environment:
1. Agent takes actions in environment
2. Receives rewards/penalties based on outcomes
3. Learns optimal policy to maximize long-term rewards
4. Balances immediate vs future rewards
5. Updates behavior based on experience
6. No explicit training data needed
7. Learning through trial and error

Question 2

Q

How do MENACE work?

Answer

A

MENACE (Matchbox Educable Noughts And Crosses Engine):
1. Uses physical beads to represent move probabilities
2. Adjusts bead quantities based on game outcomes
3. Early example of reinforcement learning.

Question 3

Q

How do TD-Gammon work?

Answer

A

TD-Gammon:
1. Uses temporal difference learning
2. Neural network evaluates positions
3. Learns by playing against itself
4. Updates predictions based on subsequent positions
5. Achieved expert-level backgammon play

Question 4

Q

What is the exploration vs. exploitation trade-off?

Answer

A

The balance between trying new strategies (exploration) and using known effective strategies (exploitation).

Key aspects:
1. Exploration finds potentially better strategies
2. Exploitation maximizes immediate rewards
3. Too much exploration wastes resources
4. Too much exploitation may miss optimal solutions
5. Balance needed for optimal learning
6. Various algorithms (ε-greedy, UCB) manage this trade-off

Question 5

Q

Why does deep search not work with chance games?

Answer

A

Deep search struggles with chance games because:

Branching factor increases dramatically with chance elements
Must consider all possible random outcomes
Cannot prune branches effectively due to probability
Computational complexity becomes overwhelming
Expected values must be calculated at chance nodes
Traditional alpha-beta pruning less effective

Question 6

Q

How can you approximate an expected value at a position?

Answer

A

Methods to approximate expected values:

Monte Carlo sampling of possible outcomes
Statistical averaging of sample results
Value function approximation using neural networks
Heuristic evaluation of position features
Combining multiple evaluation methods
Learning from self-play outcomes
Using historical data for initial estimates

Question 7

Q

What is the Multi-Armed Bandit problem?

Answer

A

The Multi-Armed Bandit problem involves:

Multiple choices (arms) with unknown reward distributions
Need to maximize total reward over time
Classic exploration vs exploitation dilemma
Each pull gives information and reward
Must balance learning arm properties with maximizing returns
Various solution algorithms (UCB, Thompson Sampling)
Applications in game AI for move selection

Question 8

Q

What is Monte-Carlo Search?

Answer

A

Monte-Carlo Search:

Uses random sampling to evaluate positions
Plays many random games (playouts) from current position
Aggregates results to estimate position value
More samples give better estimates
Can handle large search spaces
No need for position evaluation function
Especially effective in games with high branching factor

Question 9

Q

How can Monte-Carlo Techniques be integrated with Game Tree Search?

Answer

A

Integration methods:

Monte-Carlo Tree Search (MCTS) combines both approaches
UCT algorithm balances exploration/exploitation in tree
Use Monte-Carlo sampling at leaf nodes
Progressive widening for high branching factors
Rapid Action Value Estimation (RAVE) for move urgency
Virtual loss for parallel search
Combination with neural networks for evaluation

Question 10

Q

How does AlphaGo work?

Answer

A

AlphaGo components:

Policy networks learn from human games
Value networks evaluate positions
Monte-Carlo Tree Search guides search
Combines supervised and reinforcement learning
Uses rollouts for position evaluation
Multiple neural networks working together
Trained initially on human games then through self-play

Question 11

Q

What is the key (name giving) difference between AlphaGo and AlphaZero?

Answer

A

Key difference:

AlphaZero learns completely from scratch (zero human knowledge):

No human game data used
Pure self-play learning
Single neural network for both policy and value
More general learning approach
Can learn multiple games
Simpler but more powerful architecture
Demonstrates possibility of pure reinforcement learning for complex games

Question 12

Q

What is the difference between optimal and maximal play?

Answer

A

Optimal vs Maximal play distinctions:

Optimal play guarantees best possible outcome
Maximal play maximizes winning chances against imperfect opponent
Optimal play may be too conservative
Maximal play takes advantage of opponent weaknesses
Optimal play assumes perfect opponent
Maximal play may take calculated risks
Different strategies needed for each approach

Question 13

Q

Explain the perceptron activation function

Answer

A

I’ll explain this perceptron activation function in simple terms, breaking it down step by step.

This is what’s called a “threshold activation function” or “step function” for a perceptron, which is one of the simplest types of artificial neurons. Let me explain what it means:

First, let’s understand what a perceptron does:
- It takes some inputs (the aj values)
- Each input has a weight (the Wj values)
- It multiplies each input by its weight and adds them all up (that’s what the Σ symbol means)
The equation shows two cases:
- If the sum (Σ Wj·aj) is less than or equal to 0, then a = -1
- If the sum is greater than 0, then a = 1

In everyday language, it’s like a simple decision maker:
- It looks at all inputs, considering how important each one is (weights)
- If the weighted sum is positive, it outputs 1 (you can think of this as “yes” or “activate”)
- If the weighted sum is zero or negative, it outputs -1 (you can think of this as “no” or “don’t activate”)

A real-world analogy might help:
Imagine you’re deciding whether to go outside (output 1) or stay inside (-1) based on:
- Temperature (input 1)
- Rain (input 2)
- Wind (input 3)
Each factor has a different importance (weight) to you. You consider all these factors together to make a yes/no decision.

Question 14

Q

Explain the perceptron learning rule

Answer

A

Let me explain this perceptron learning rule in simple terms.

This equation shows how a perceptron’s weights (Wj) are updated during supervised learning. Let’s break it down:

The left side “Wj ← Wj” means we’re updating the weight value
The formula is: New Weight = Old Weight + (Learning Rate × Error × Input)
- Learning rate (α): Controls how big each learning step should be (like how fast it learns)
- Error (f(x) - h(x)): The difference between:
  - f(x): The correct/desired output
  - h(x): The actual output the perceptron gave
- Input (xj): The input value for this particular weight

Think of it like learning from mistakes:
- If the error is big, the weight change will be bigger
- If the error is small, the weight change will be smaller
- The learning rate controls how drastically we adjust our weights each time
- We multiply by the input because we want to adjust weights more for inputs that contributed more to the error

A real-world analogy:
Imagine you’re learning to cook soup. Each ingredient is an input, and how much of each ingredient you use is like a weight:
- If the soup is too salty (error), you’ll reduce the weight (amount) of salt more than other ingredients
- How much you adjust each time (learning rate) depends on how cautious you want to be
- You learn by comparing the taste you got (h(x)) with the taste you wanted (f(x))

Question 15

Q

Explain this error function for the single perceptron

Answer

A

Let me explain this error function (also called loss or cost function) step by step:

E(x) represents how much the network’s output differs from what we wanted. It’s measuring the network’s mistake. Let’s break it down from right to left:

Inside the parentheses:
- f(x) is the correct/desired output we wanted
- g(ΣWj·xj) is what the network actually output
  - ΣWj·xj is the weighted sum of all inputs
  - g() is the activation function that converts this sum into the final output
We square the difference (that’s what the ^2 means) to:
- Make all errors positive (since we don’t care if we guessed too high or too low)
- Penalize bigger errors more heavily
The ½ at the front is just to make the math simpler when we calculate derivatives (you don’t need to worry about this detail for basic understanding)

A simple real-world example:
Imagine you’re teaching a network to predict house prices:
- If the actual price (f(x)) is $200,000
- And your network predicts (g(ΣWj·xj)) $180,000
- Then the error would be: ½(200,000 - 180,000)² = ½(20,000)² = 200,000,000

The bigger the mistake, the bigger the error value gets. This helps the network understand how badly it’s performing and adjust accordingly.

Question 16

Q

Explain the error function for the entire network

and how it differs from the error function for the single perceptron

Answer

Study These Flashcards

A

Ah, let me explain the key difference between these two error functions:

The previous function E(x) measured the error for a SINGLE input/prediction, while this function E(Network) measures the TOTAL error across ALL training examples in your dataset.

Let me break it down:
- E(xi) is the error for one training example (using the formula we just discussed)
- The Σ symbol means we’re adding up these errors for all N training examples
- N is the total number of training examples in your dataset

For example, if you’re predicting house prices:
- The previous function E(x) would tell you how wrong your prediction was for ONE house
- This function E(Network) adds up the errors for ALL houses in your dataset to tell you how well your network performs overall

It’s like the difference between:
- Checking how far off you were on one test question (previous function)
- Calculating your total score across all questions on the test (this function)

This overall error helps you understand if your network is generally doing well or poorly across all your training data, not just on a single example.

Question 17

Q

Explain the function to calculate the gradient of the error with respect to a specific weight in the network

Answer

Study These Flashcards

A

This equation shows how to calculate the gradient (rate of change) of the error with respect to a specific weight in the network. Let me break it down in simpler terms:

∂E/∂Wj represents how much the error changes when we slightly change a specific weight (Wj)
The equation shows three equivalent ways to write this:
- First part: Error times the derivative of Error with respect to the weight
- Middle part: Error times the derivative of the difference between desired and actual output
- Final result: -Err·g’(in)·xj where:
  - Err is the error term (difference between desired and actual output)
  - g’(in) is the derivative of the activation function
  - xj is the input value for this weight

In practical terms, this tells us:
- How much and in which direction we should adjust each weight
- Bigger errors lead to bigger adjustments
- The adjustment also depends on:
* The input value (xj)
* How sensitive the activation function is at that point (g’(in))

This is a key formula in backpropagation, which is how neural networks learn from their mistakes. It helps determine how to adjust weights to reduce errors.

Question 18

Q

Explain the Sigmoid activation function

Answer

Study These Flashcards

A

Yes, this shows the sigmoid activation function and its derivative! Let me explain both:

The sigmoid function g(x):
- It’s an S-shaped curve that squishes any input into a value between 0 and 1
- The formula is g(x) = 1/(1 + e^(-x))
- Where e is the mathematical constant (approximately 2.71828)

Key properties of the sigmoid:
- Always outputs between 0 and 1
- Smooth and continuous (no sudden jumps)
- When input is very negative → output close to 0
- When input is very positive → output close to 1
- When input is 0 → output is 0.5

The derivative g’(x):
- This tells us how fast the sigmoid function is changing at any point
- The formula g’(x) = g(x)(1 - g(x)) is a nice, simple form
- It’s used in backpropagation for learning

Why it’s useful:
- The smooth output between 0 and 1 makes it good for probability-like outputs
- Its derivative is simple and related to its output
- It was historically very popular (though modern networks often use other functions like ReLU)

Think of it like a “smoothed out” version of the step function we saw earlier - instead of jumping directly from -1 to 1, it makes a smooth S-shaped transition between 0 and 1.

Would you like me to elaborate on any part of this explanation?

Question 19

Q

What characterises deep networks

Answer

Study These Flashcards

A

Deep networks are networks with multiple layers

E.g. image classification:
- 1st layer -> edges
- 2nd layer -> corners, etc.

Key ingredients:
- lots of data and compute
- unsupervised pre-training of layers

Question 20

Q

Convolutional Neural Networks

What is convolution

Answer

Study These Flashcards

A

For each pixel of an image a new feature is computed using a weighted combination of its nxn neighborhood

Question 21

Q

Explain training procedure for minimizing the network error

Answer

Study These Flashcards

A

Training procedure:
- try all examples in turn
- make small adjustments for each example
- repeat until convergence
- One Epoch = One iteration through all examples

Question 22

Q

What does Recurrent Neural Networks (RNN) allow to do?

Answer

Study These Flashcards

A

RNN allow to process sequential data by feeding back the output of the netwirk into the next input.

Question 23

Q

Answer

Study These Flashcards

A

ml-games-flashcards

(23 cards)