ml-games-flashcards

1
Q

What is the key idea of reinforcement learning?

A

Reinforcement learning involves learning through interaction with an environment:
1. Agent takes actions in environment
2. Receives rewards/penalties based on outcomes
3. Learns optimal policy to maximize long-term rewards
4. Balances immediate vs future rewards
5. Updates behavior based on experience
6. No explicit training data needed
7. Learning through trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do MENACE work?

A

MENACE (Matchbox Educable Noughts And Crosses Engine):
1. Uses physical beads to represent move probabilities
2. Adjusts bead quantities based on game outcomes
3. Early example of reinforcement learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do TD-Gammon work?

A

TD-Gammon:
1. Uses temporal difference learning
2. Neural network evaluates positions
3. Learns by playing against itself
4. Updates predictions based on subsequent positions
5. Achieved expert-level backgammon play

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the exploration vs. exploitation trade-off?

A

The balance between trying new strategies (exploration) and using known effective strategies (exploitation).

Key aspects:
1. Exploration finds potentially better strategies
2. Exploitation maximizes immediate rewards
3. Too much exploration wastes resources
4. Too much exploitation may miss optimal solutions
5. Balance needed for optimal learning
6. Various algorithms (ε-greedy, UCB) manage this trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why does deep search not work with chance games?

A

Deep search struggles with chance games because:

  1. Branching factor increases dramatically with chance elements
  2. Must consider all possible random outcomes
  3. Cannot prune branches effectively due to probability
  4. Computational complexity becomes overwhelming
  5. Expected values must be calculated at chance nodes
  6. Traditional alpha-beta pruning less effective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you approximate an expected value at a position?

A

Methods to approximate expected values:

  1. Monte Carlo sampling of possible outcomes
  2. Statistical averaging of sample results
  3. Value function approximation using neural networks
  4. Heuristic evaluation of position features
  5. Combining multiple evaluation methods
  6. Learning from self-play outcomes
  7. Using historical data for initial estimates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Multi-Armed Bandit problem?

A

The Multi-Armed Bandit problem involves:

  1. Multiple choices (arms) with unknown reward distributions
  2. Need to maximize total reward over time
  3. Classic exploration vs exploitation dilemma
  4. Each pull gives information and reward
  5. Must balance learning arm properties with maximizing returns
  6. Various solution algorithms (UCB, Thompson Sampling)
  7. Applications in game AI for move selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Monte-Carlo Search?

A

Monte-Carlo Search:

  1. Uses random sampling to evaluate positions
  2. Plays many random games (playouts) from current position
  3. Aggregates results to estimate position value
  4. More samples give better estimates
  5. Can handle large search spaces
  6. No need for position evaluation function
  7. Especially effective in games with high branching factor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can Monte-Carlo Techniques be integrated with Game Tree Search?

A

Integration methods:

  1. Monte-Carlo Tree Search (MCTS) combines both approaches
  2. UCT algorithm balances exploration/exploitation in tree
  3. Use Monte-Carlo sampling at leaf nodes
  4. Progressive widening for high branching factors
  5. Rapid Action Value Estimation (RAVE) for move urgency
  6. Virtual loss for parallel search
  7. Combination with neural networks for evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does AlphaGo work?

A

AlphaGo components:

  1. Policy networks learn from human games
  2. Value networks evaluate positions
  3. Monte-Carlo Tree Search guides search
  4. Combines supervised and reinforcement learning
  5. Uses rollouts for position evaluation
  6. Multiple neural networks working together
  7. Trained initially on human games then through self-play
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the key (name giving) difference between AlphaGo and AlphaZero?

A

Key difference:

AlphaZero learns completely from scratch (zero human knowledge):

  1. No human game data used
  2. Pure self-play learning
  3. Single neural network for both policy and value
  4. More general learning approach
  5. Can learn multiple games
  6. Simpler but more powerful architecture
  7. Demonstrates possibility of pure reinforcement learning for complex games
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between optimal and maximal play?

A

Optimal vs Maximal play distinctions:

  1. Optimal play guarantees best possible outcome
  2. Maximal play maximizes winning chances against imperfect opponent
  3. Optimal play may be too conservative
  4. Maximal play takes advantage of opponent weaknesses
  5. Optimal play assumes perfect opponent
  6. Maximal play may take calculated risks
  7. Different strategies needed for each approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the perceptron activation function

A

I’ll explain this perceptron activation function in simple terms, breaking it down step by step.

This is what’s called a “threshold activation function” or “step function” for a perceptron, which is one of the simplest types of artificial neurons. Let me explain what it means:

  1. First, let’s understand what a perceptron does:
    • It takes some inputs (the aj values)
    • Each input has a weight (the Wj values)
    • It multiplies each input by its weight and adds them all up (that’s what the Σ symbol means)
  2. The equation shows two cases:
    • If the sum (Σ Wj·aj) is less than or equal to 0, then a = -1
    • If the sum is greater than 0, then a = 1

In everyday language, it’s like a simple decision maker:
- It looks at all inputs, considering how important each one is (weights)
- If the weighted sum is positive, it outputs 1 (you can think of this as “yes” or “activate”)
- If the weighted sum is zero or negative, it outputs -1 (you can think of this as “no” or “don’t activate”)

A real-world analogy might help:
Imagine you’re deciding whether to go outside (output 1) or stay inside (-1) based on:
- Temperature (input 1)
- Rain (input 2)
- Wind (input 3)
Each factor has a different importance (weight) to you. You consider all these factors together to make a yes/no decision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the perceptron learning rule

A

Let me explain this perceptron learning rule in simple terms.

This equation shows how a perceptron’s weights (Wj) are updated during supervised learning. Let’s break it down:

  1. The left side “Wj ← Wj” means we’re updating the weight value
  2. The formula is: New Weight = Old Weight + (Learning Rate × Error × Input)
    • Learning rate (α): Controls how big each learning step should be (like how fast it learns)
    • Error (f(x) - h(x)): The difference between:
      • f(x): The correct/desired output
      • h(x): The actual output the perceptron gave
    • Input (xj): The input value for this particular weight

Think of it like learning from mistakes:
- If the error is big, the weight change will be bigger
- If the error is small, the weight change will be smaller
- The learning rate controls how drastically we adjust our weights each time
- We multiply by the input because we want to adjust weights more for inputs that contributed more to the error

A real-world analogy:
Imagine you’re learning to cook soup. Each ingredient is an input, and how much of each ingredient you use is like a weight:
- If the soup is too salty (error), you’ll reduce the weight (amount) of salt more than other ingredients
- How much you adjust each time (learning rate) depends on how cautious you want to be
- You learn by comparing the taste you got (h(x)) with the taste you wanted (f(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain this error function for the single perceptron

A

Let me explain this error function (also called loss or cost function) step by step:

E(x) represents how much the network’s output differs from what we wanted. It’s measuring the network’s mistake. Let’s break it down from right to left:

  1. Inside the parentheses:
    • f(x) is the correct/desired output we wanted
    • g(ΣWj·xj) is what the network actually output
      • ΣWj·xj is the weighted sum of all inputs
      • g() is the activation function that converts this sum into the final output
  2. We square the difference (that’s what the ^2 means) to:
    • Make all errors positive (since we don’t care if we guessed too high or too low)
    • Penalize bigger errors more heavily
  3. The ½ at the front is just to make the math simpler when we calculate derivatives (you don’t need to worry about this detail for basic understanding)

A simple real-world example:
Imagine you’re teaching a network to predict house prices:
- If the actual price (f(x)) is $200,000
- And your network predicts (g(ΣWj·xj)) $180,000
- Then the error would be: ½(200,000 - 180,000)² = ½(20,000)² = 200,000,000

The bigger the mistake, the bigger the error value gets. This helps the network understand how badly it’s performing and adjust accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the error function for the entire network

and how it differs from the error function for the single perceptron

A

Ah, let me explain the key difference between these two error functions:

The previous function E(x) measured the error for a SINGLE input/prediction, while this function E(Network) measures the TOTAL error across ALL training examples in your dataset.

Let me break it down:
- E(xi) is the error for one training example (using the formula we just discussed)
- The Σ symbol means we’re adding up these errors for all N training examples
- N is the total number of training examples in your dataset

For example, if you’re predicting house prices:
- The previous function E(x) would tell you how wrong your prediction was for ONE house
- This function E(Network) adds up the errors for ALL houses in your dataset to tell you how well your network performs overall

It’s like the difference between:
- Checking how far off you were on one test question (previous function)
- Calculating your total score across all questions on the test (this function)

This overall error helps you understand if your network is generally doing well or poorly across all your training data, not just on a single example.

17
Q

Explain the function to calculate the gradient of the error with respect to a specific weight in the network

A

This equation shows how to calculate the gradient (rate of change) of the error with respect to a specific weight in the network. Let me break it down in simpler terms:

  1. ∂E/∂Wj represents how much the error changes when we slightly change a specific weight (Wj)
  2. The equation shows three equivalent ways to write this:
    • First part: Error times the derivative of Error with respect to the weight
    • Middle part: Error times the derivative of the difference between desired and actual output
    • Final result: -Err·g’(in)·xj where:
      • Err is the error term (difference between desired and actual output)
      • g’(in) is the derivative of the activation function
      • xj is the input value for this weight

In practical terms, this tells us:
- How much and in which direction we should adjust each weight
- Bigger errors lead to bigger adjustments
- The adjustment also depends on:
* The input value (xj)
* How sensitive the activation function is at that point (g’(in))

This is a key formula in backpropagation, which is how neural networks learn from their mistakes. It helps determine how to adjust weights to reduce errors.

18
Q

Explain the Sigmoid activation function

A

Yes, this shows the sigmoid activation function and its derivative! Let me explain both:

  1. The sigmoid function g(x):
    • It’s an S-shaped curve that squishes any input into a value between 0 and 1
    • The formula is g(x) = 1/(1 + e^(-x))
    • Where e is the mathematical constant (approximately 2.71828)

Key properties of the sigmoid:
- Always outputs between 0 and 1
- Smooth and continuous (no sudden jumps)
- When input is very negative → output close to 0
- When input is very positive → output close to 1
- When input is 0 → output is 0.5

  1. The derivative g’(x):
    • This tells us how fast the sigmoid function is changing at any point
    • The formula g’(x) = g(x)(1 - g(x)) is a nice, simple form
    • It’s used in backpropagation for learning

Why it’s useful:
- The smooth output between 0 and 1 makes it good for probability-like outputs
- Its derivative is simple and related to its output
- It was historically very popular (though modern networks often use other functions like ReLU)

Think of it like a “smoothed out” version of the step function we saw earlier - instead of jumping directly from -1 to 1, it makes a smooth S-shaped transition between 0 and 1.

Would you like me to elaborate on any part of this explanation?

19
Q

What characterises deep networks

A

Deep networks are networks with multiple layers

E.g. image classification:
- 1st layer -> edges
- 2nd layer -> corners, etc.

Key ingredients:
- lots of data and compute
- unsupervised pre-training of layers

20
Q

Convolutional Neural Networks

What is convolution

A

For each pixel of an image a new feature is computed using a weighted combination of its nxn neighborhood

21
Q

Explain training procedure for minimizing the network error

A

Training procedure:
- try all examples in turn
- make small adjustments for each example
- repeat until convergence
- One Epoch = One iteration through all examples

22
Q

What does Recurrent Neural Networks (RNN) allow to do?

A

RNN allow to process sequential data by feeding back the output of the netwirk into the next input.

23
Q
A