lecture 3 Flashcards
What are policy-based methods in reinforcement learning?
- a class of reinforcement learning algorithms that learn a policy directly, instead of learning a value function like in value-based methods
- A policy maps states to actions and can be either deterministic or stochastic.
What is the difference between deterministic and stochastic policies in policy-based methods?
- Deterministic policy: Always chooses the same action given a specific state.
- Stochastic policy: Defines a probability distribution over actions for each state, allowing for exploration.
Why are policy-based methods useful for stochastic policies?
- useful for building policies that can explicitly explore
- are particularly effective in multi-agent systems where the Nash equilibrium is a stochastic policy.
What environments are are policy-based methods advantageous in?
environments with continuous action spaces
What is a stationary policy in reinforcement learning?
- a policy that does not change over time, meaning it remains the same across different time steps
- suitable for infinite-horizon problems where the agent seeks to maximize long-term rewards over an indefinite period
What is a non-stationary policy in reinforcement learning?
- the policy depends on the time step
- is useful for finite-horizon problems, where the agent seeks to optimize cumulative rewards over a limited number of future time steps.
What is the key difference between off-policy and on-policy methods?
- Off-policy methods: Evaluate or improve a policy that is different from the one used to generate the data. They are more sample efficient as they can reuse data from different policies. Example: Q-learning.
- On-policy methods: Evaluate or improve the policy that is currently being used by the agent. They may suffer from bias if a replay buffer is used, as stored trajectories may not correspond to the current policy. Example: SARSA.
Why are off-policy methods considered more general than on-policy methods?
because they can learn from data generated by any behavior policy, not just the current policy being used by the agent
What are policy gradient methods in reinforcement learning?
- Policy gradient methods optimize a performance objective (typically the expected cumulative reward) by finding a good policy, often parameterized by a neural network.
- These methods use stochastic gradient ascent to improve the policy.
What is the goal of policy gradient methods in reinforcement learning?
to find a good policy by optimizing a performance objective, typically the expected cumulative reward.
Why is the policy gradient approach useful in reinforcement learning?
It allows the gradient to be estimated using experience, and the policy improvement step increases the probability of actions proportionally to their expected return.
What is the REINFORCE algorithm?
The REINFORCE algorithm is an on-policy gradient-based method that estimates the return for each action using Monte Carlo rollouts.
What are the main advantages and disadvantages of the REINFORCE algorithm?
- Advantage: It provides an unbiased estimate of the gradient.
- Disadvantage: It can have high variance and requires many rollouts to converge.
How do actor-critic methods differ from the REINFORCE algorithm?
Actor-critic methods use a value-based approach to estimate the return, which makes them more efficient by reducing variance compared to pure Monte Carlo methods.
What is the role of the entropy regularizer in policy gradient methods?
encourages the policy to remain stochastic, preventing it from becoming deterministic too quickly and ensuring better exploration.
Why is a baseline used in policy gradient methods, and how does it improve efficiency?
reduces the variance of the gradient estimator without introducing bias, leading to more stable and efficient updates.
What is the key concept behind using a baseline in policy gradient updates?
subtracting a baseline from the return reduces the gradient’s variance, allowing for more efficient numerical updates and faster convergence.
What are the two main components of the policy gradient theorem?
- Estimating the gradient using sampled trajectories.
- Updating the policy by increasing the probability of actions that lead to higher rewards.
Why is the softmax function used in policy optimization with a finite number of actions?
The softmax function ensures that the policy outputs a probability distribution over actions by transforming the output of the neural network into probabilities.
At what stage of the neural network is the softmax function applied in policy optimization?
at the last layer of the neural network to generate a valid probability distribution over actions.
How can value-based methods and policy-based methods be related in specific settings?
In certain settings, depending on the loss function and entropy regularization, value-based methods and policy-based methods can become equivalent, with both aiming to find the optimal policy.
What is the main goal of benchmarking deep reinforcement learning algorithms?
to evaluate their performance under controlled and reproducible conditions
How do we test the effectiveness of a reinforcement learning algorithm during benchmarking?
By averaging its performance across multiple learning trials and, if possible, applying significance testing techniques to the results.
Why is it necessary to conduct multiple learning trials in benchmarking RL algorithms?
Because stochasticity, such as randomness in neural network initialization and environment variability, can affect the results significantly.
Why should benchmarking results not be over-interpreted?
A hypothesis may hold for specific environments and hyperparameter settings but fail in other scenarios, so results may not generalize well.
What steps are important to ensure a fair comparison between RL algorithms?
Ensuring the use of the same random seeds and applying identical hyperparameter tuning procedures across algorithms.
Why is using top-K trials inadequate for fair benchmarking of RL algorithms?
It can exaggerate the performance of certain algorithms, as it only reflects the best outcomes rather than the average robustness.
What is a more reliable metric than top-K trials for reporting the performance of RL algorithms?
The average performance across all trials, as it better reflects the true robustness of an algorithm.
What are the three main components an RL agent may include? (may include one or more)
- A representation of a value function for predicting how good a state or state-action pair is
- A direct representation of the policy that maps states to actions
- A model of the environment for planning.
How do model-free RL methods differ from model-based methods?
Model-free methods learn directly from interactions with the environment without building a model, while model-based methods involve learning a model of the environment and using it for planning actions.
What are the respective strengths of model-based and model-free RL methods?
- Model-free methods are preferable when the agent lacks access to a generative model of the environment, avoiding inaccuracies in learned models.
- Model-based methods work well with planning algorithms but can be computationally demanding.
- For certain tasks (e.g., structured tasks like mazes), model-based methods can learn the environment more efficiently due to the specific nature of the task.