lecture 1 Flashcards

Question

What are the key strategies for improving generalization in reinforcement learning?

Answer 1

1. **Using an abstract representation**: Create a compact and informative state representation by discarding irrelevant or redundant features. 2. **Optimizing the objective function**: Properly shape the reward function and tune key parameters (e.g., reward shaping, tuning the discount factor 𝛾). 3. **Choosing the appropriate learning algorithm**: Select the right type of function approximator (e.g., neural networks) and decide between model-based and model-free approaches. 4. **Improving the dataset**: Enhance the diversity of the dataset using better exploration strategies to improve generalization and address the exploration/exploitation dilemma.

Answer 2

reduces overfitting and improves generalization by **focusing only on essential features**, avoiding spurious correlations caused by irrelevant features

Answer 3

- helps balance bias and overfitting - A **small but rich** abstract representation allows for improved generalization by focusing on essential information and reducing unnecessary complexity.

Answer 4

1. Overfitting: The RL algorithm may pick up spurious correlations. 2. Increased variance: Adds complexity without meaningful improvement in performance.

Answer 5

- Removing irrelevant or redundant features helps reduce overfitting (variance) but may introduce some bias. - This is because removing features that differentiate states with different roles in the dynamics can reduce the model’s expressivity.

Answer 6

- by introducing a bias that helps the agent generalize better - This involves optimizing an objective function that may diverge from the true objective but accelerates learning and improves performance in practice.

Answer 7

1. **Reward shaping**: - Adds an auxiliary reward signal to help the agent learn faster. - Can accelerate learning but requires careful design to avoid misleading the agent with rewards that deviate too far from the true objective. - Reward normalization is important for deep RL. 2. **Tuning the discount factor 𝛾**: - Determines how much the agent values future rewards relative to immediate rewards. - Proper tuning balances short-term and long-term objectives, improving the agent’s generalization ability across tasks.

Answer 8

1. Lower 𝛾: The agent focuses more on immediate rewards, which is useful for tasks where quick feedback is essential or long-term planning is less critical. 2. Higher 𝛾: The agent prioritizes long-term cumulative rewards, which is important in environments where future rewards are more significant. Proper tuning helps balance the tradeoff between short-term and long-term objectives.

Answer 9

- to start with a low value of 𝛾 and gradually tune it during training. - In the given case, a discount factor of 1 is used with a finite horizon at test time.

Answer 10

1. **Value function representation**: Predicts the expected return for each state or state-action pair, indicating how good a state or action is. 2. **Policy representation**: Directly maps states to actions. 3. **Model of the environment**: Predicts the next state and reward, which can be used for planning in model-based reinforcement learning.

Answer 11

1. Model-free approaches: Directly learn the policy or value function without building a model of the environment. 2. Model-based approaches: Use a learned model of the environment to plan actions, typically leading to more sample-efficient learning.

Answer 12

- Function approximators, such as neural networks, map inputs (states) to outputs (values or policies). - i.e., used to **estimate value functions or policies** in deep reinforcement learning. - They are crucial for enabling the agent to generalize and determine how features are treated at higher levels of abstraction.

Answer 13

1. The choice of function approximator determines the **level of abstraction** introduced by deep learning, which depends on the network architecture and feature selection mechanisms (e.g., attention mechanisms). 2. Depending on the task, choosing between **model-free or model-based** approaches is critical, as model-based methods are more sample-efficient but require accurate models, while model-free methods are often more robust in complex environments without a model.

Answer 14

1. System 1: Fast, instinctive, and automatic decision-making, similar to model-free RL, where decisions are based on learned patterns. 2. System 2: Slow, deliberate, and logical decision-making, similar to model-based RL, where decisions involve planning based on a model of the environment.

Answer 15

- leveraging knowledge acquired in one task or environment to improve performance in a different but related task or environment - involves using a pre-trained policy or value function in a different setting that may share similar dynamics but differ in visual representation, reward structure, or specific states

Answer 16

- Exploration is critical **because the agent needs to gather sufficient information about the environment to find an optimal policy**. - The main challenge is balancing exploration (trying new actions) and exploitation (choosing known best actions) - Too much exploitation can lead to suboptimal solutions. - Too much exploration can delay convergence to an optimal policy.

Answer 17

- methods that do not prioritize specific areas of the environment based on uncertainty or potential rewards - ex: epsilon-greedy

Answer 18

- aims to explore areas with high novelty or uncertainty - useful when: 1. When rewards are not sparse, uncertainty in the value function can guide exploration. 2. When rewards are sparse, explicit exploration rewards may be used. - ex: UCB, thompson sampling

Answer 19

- govern how the agent updates its function approximators using data from interactions with the environment. - Examples include Q-learning and policy gradients.

Answer 20

**stores past experiences** (state, action, reward, next state) to **break temporal correlations** in training data, thereby improving sample efficiency and stability during training.

Answer 21

Controllers handle meta-level operations such as training, validation, and testing phases, as well as hyperparameter tuning.

Answer 22

- define how the agent chooses actions, balancing exploration and exploitation - e.g., e-greedy policy

Answer 23

The environment is where the agent interacts to gather experiences and learn optimal behavior through trial and error.

lecture 1 Flashcards

(47 cards)