ml-games-flashcards
What is the key idea of reinforcement learning?
Reinforcement learning involves learning through interaction with an environment:
1. Agent takes actions in environment
2. Receives rewards/penalties based on outcomes
3. Learns optimal policy to maximize long-term rewards
4. Balances immediate vs future rewards
5. Updates behavior based on experience
6. No explicit training data needed
7. Learning through trial and error
How do MENACE work?
MENACE (Matchbox Educable Noughts And Crosses Engine):
1. Uses physical beads to represent move probabilities
2. Adjusts bead quantities based on game outcomes
3. Early example of reinforcement learning.
How do TD-Gammon work?
TD-Gammon:
1. Uses temporal difference learning
2. Neural network evaluates positions
3. Learns by playing against itself
4. Updates predictions based on subsequent positions
5. Achieved expert-level backgammon play
What is the exploration vs. exploitation trade-off?
The balance between trying new strategies (exploration) and using known effective strategies (exploitation).
Key aspects:
1. Exploration finds potentially better strategies
2. Exploitation maximizes immediate rewards
3. Too much exploration wastes resources
4. Too much exploitation may miss optimal solutions
5. Balance needed for optimal learning
6. Various algorithms (ε-greedy, UCB) manage this trade-off
Why does deep search not work with chance games?
Deep search struggles with chance games because:
- Branching factor increases dramatically with chance elements
- Must consider all possible random outcomes
- Cannot prune branches effectively due to probability
- Computational complexity becomes overwhelming
- Expected values must be calculated at chance nodes
- Traditional alpha-beta pruning less effective
How can you approximate an expected value at a position?
Methods to approximate expected values:
- Monte Carlo sampling of possible outcomes
- Statistical averaging of sample results
- Value function approximation using neural networks
- Heuristic evaluation of position features
- Combining multiple evaluation methods
- Learning from self-play outcomes
- Using historical data for initial estimates
What is the Multi-Armed Bandit problem?
The Multi-Armed Bandit problem involves:
- Multiple choices (arms) with unknown reward distributions
- Need to maximize total reward over time
- Classic exploration vs exploitation dilemma
- Each pull gives information and reward
- Must balance learning arm properties with maximizing returns
- Various solution algorithms (UCB, Thompson Sampling)
- Applications in game AI for move selection
What is Monte-Carlo Search?
Monte-Carlo Search:
- Uses random sampling to evaluate positions
- Plays many random games (playouts) from current position
- Aggregates results to estimate position value
- More samples give better estimates
- Can handle large search spaces
- No need for position evaluation function
- Especially effective in games with high branching factor
How can Monte-Carlo Techniques be integrated with Game Tree Search?
Integration methods:
- Monte-Carlo Tree Search (MCTS) combines both approaches
- UCT algorithm balances exploration/exploitation in tree
- Use Monte-Carlo sampling at leaf nodes
- Progressive widening for high branching factors
- Rapid Action Value Estimation (RAVE) for move urgency
- Virtual loss for parallel search
- Combination with neural networks for evaluation
How does AlphaGo work?
AlphaGo components:
- Policy networks learn from human games
- Value networks evaluate positions
- Monte-Carlo Tree Search guides search
- Combines supervised and reinforcement learning
- Uses rollouts for position evaluation
- Multiple neural networks working together
- Trained initially on human games then through self-play
What is the key (name giving) difference between AlphaGo and AlphaZero?
Key difference:
AlphaZero learns completely from scratch (zero human knowledge):
- No human game data used
- Pure self-play learning
- Single neural network for both policy and value
- More general learning approach
- Can learn multiple games
- Simpler but more powerful architecture
- Demonstrates possibility of pure reinforcement learning for complex games
What is the difference between optimal and maximal play?
Optimal vs Maximal play distinctions:
- Optimal play guarantees best possible outcome
- Maximal play maximizes winning chances against imperfect opponent
- Optimal play may be too conservative
- Maximal play takes advantage of opponent weaknesses
- Optimal play assumes perfect opponent
- Maximal play may take calculated risks
- Different strategies needed for each approach
Explain the perceptron activation function
I’ll explain this perceptron activation function in simple terms, breaking it down step by step.
This is what’s called a “threshold activation function” or “step function” for a perceptron, which is one of the simplest types of artificial neurons. Let me explain what it means:
- First, let’s understand what a perceptron does:
- It takes some inputs (the aj values)
- Each input has a weight (the Wj values)
- It multiplies each input by its weight and adds them all up (that’s what the Σ symbol means)
- The equation shows two cases:
- If the sum (Σ Wj·aj) is less than or equal to 0, then a = -1
- If the sum is greater than 0, then a = 1
In everyday language, it’s like a simple decision maker:
- It looks at all inputs, considering how important each one is (weights)
- If the weighted sum is positive, it outputs 1 (you can think of this as “yes” or “activate”)
- If the weighted sum is zero or negative, it outputs -1 (you can think of this as “no” or “don’t activate”)
A real-world analogy might help:
Imagine you’re deciding whether to go outside (output 1) or stay inside (-1) based on:
- Temperature (input 1)
- Rain (input 2)
- Wind (input 3)
Each factor has a different importance (weight) to you. You consider all these factors together to make a yes/no decision.
Explain the perceptron learning rule
Let me explain this perceptron learning rule in simple terms.
This equation shows how a perceptron’s weights (Wj) are updated during supervised learning. Let’s break it down:
- The left side “Wj ← Wj” means we’re updating the weight value
- The formula is: New Weight = Old Weight + (Learning Rate × Error × Input)
- Learning rate (α): Controls how big each learning step should be (like how fast it learns)
- Error (f(x) - h(x)): The difference between:
- f(x): The correct/desired output
- h(x): The actual output the perceptron gave
- Input (xj): The input value for this particular weight
Think of it like learning from mistakes:
- If the error is big, the weight change will be bigger
- If the error is small, the weight change will be smaller
- The learning rate controls how drastically we adjust our weights each time
- We multiply by the input because we want to adjust weights more for inputs that contributed more to the error
A real-world analogy:
Imagine you’re learning to cook soup. Each ingredient is an input, and how much of each ingredient you use is like a weight:
- If the soup is too salty (error), you’ll reduce the weight (amount) of salt more than other ingredients
- How much you adjust each time (learning rate) depends on how cautious you want to be
- You learn by comparing the taste you got (h(x)) with the taste you wanted (f(x))
Explain this error function for the single perceptron
Let me explain this error function (also called loss or cost function) step by step:
E(x) represents how much the network’s output differs from what we wanted. It’s measuring the network’s mistake. Let’s break it down from right to left:
- Inside the parentheses:
- f(x) is the correct/desired output we wanted
- g(ΣWj·xj) is what the network actually output
- ΣWj·xj is the weighted sum of all inputs
- g() is the activation function that converts this sum into the final output
- We square the difference (that’s what the ^2 means) to:
- Make all errors positive (since we don’t care if we guessed too high or too low)
- Penalize bigger errors more heavily
- The ½ at the front is just to make the math simpler when we calculate derivatives (you don’t need to worry about this detail for basic understanding)
A simple real-world example:
Imagine you’re teaching a network to predict house prices:
- If the actual price (f(x)) is $200,000
- And your network predicts (g(ΣWj·xj)) $180,000
- Then the error would be: ½(200,000 - 180,000)² = ½(20,000)² = 200,000,000
The bigger the mistake, the bigger the error value gets. This helps the network understand how badly it’s performing and adjust accordingly.