lecture 3 Flashcards by Kiara Shivani

What are policy-based methods in reinforcement learning?

a class of reinforcement learning algorithms that learn a policy directly, instead of learning a value function like in value-based methods
A policy maps states to actions and can be either deterministic or stochastic.

How well did you know this?

Not at all

Perfectly

What is the difference between deterministic and stochastic policies in policy-based methods?

Deterministic policy: Always chooses the same action given a specific state.
Stochastic policy: Defines a probability distribution over actions for each state, allowing for exploration.

How well did you know this?

Not at all

Perfectly

Why are policy-based methods useful for stochastic policies?

useful for building policies that can explicitly explore
are particularly effective in multi-agent systems where the Nash equilibrium is a stochastic policy.

How well did you know this?

Not at all

Perfectly

What environments are are policy-based methods advantageous in?

environments with continuous action spaces

How well did you know this?

Not at all

Perfectly

What is a stationary policy in reinforcement learning?

a policy that does not change over time, meaning it remains the same across different time steps
suitable for infinite-horizon problems where the agent seeks to maximize long-term rewards over an indefinite period

How well did you know this?

Not at all

Perfectly

What is a non-stationary policy in reinforcement learning?

the policy depends on the time step
is useful for finite-horizon problems, where the agent seeks to optimize cumulative rewards over a limited number of future time steps.

How well did you know this?

Not at all

Perfectly

What is the key difference between off-policy and on-policy methods?

Off-policy methods: Evaluate or improve a policy that is different from the one used to generate the data. They are more sample efficient as they can reuse data from different policies. Example: Q-learning.
On-policy methods: Evaluate or improve the policy that is currently being used by the agent. They may suffer from bias if a replay buffer is used, as stored trajectories may not correspond to the current policy. Example: SARSA.

How well did you know this?

Not at all

Perfectly

Why are off-policy methods considered more general than on-policy methods?

because they can learn from data generated by any behavior policy, not just the current policy being used by the agent

How well did you know this?

Not at all

Perfectly

What are policy gradient methods in reinforcement learning?

Policy gradient methods optimize a performance objective (typically the expected cumulative reward) by finding a good policy, often parameterized by a neural network.
These methods use stochastic gradient ascent to improve the policy.

How well did you know this?

Not at all

Perfectly

What is the goal of policy gradient methods in reinforcement learning?

to find a good policy by optimizing a performance objective, typically the expected cumulative reward.

How well did you know this?

Not at all

Perfectly

Why is the policy gradient approach useful in reinforcement learning?

It allows the gradient to be estimated using experience, and the policy improvement step increases the probability of actions proportionally to their expected return.

How well did you know this?

Not at all

Perfectly

What is the REINFORCE algorithm?

The REINFORCE algorithm is an on-policy gradient-based method that estimates the return for each action using Monte Carlo rollouts.

How well did you know this?

Not at all

Perfectly

What are the main advantages and disadvantages of the REINFORCE algorithm?

Advantage: It provides an unbiased estimate of the gradient.
Disadvantage: It can have high variance and requires many rollouts to converge.

How well did you know this?

Not at all

Perfectly

How do actor-critic methods differ from the REINFORCE algorithm?

Actor-critic methods use a value-based approach to estimate the return, which makes them more efficient by reducing variance compared to pure Monte Carlo methods.

How well did you know this?

Not at all

Perfectly

What is the role of the entropy regularizer in policy gradient methods?

encourages the policy to remain stochastic, preventing it from becoming deterministic too quickly and ensuring better exploration.

How well did you know this?

Not at all

Perfectly

Why is a baseline used in policy gradient methods, and how does it improve efficiency?

Study These Flashcards

reduces the variance of the gradient estimator without introducing bias, leading to more stable and efficient updates.

What is the key concept behind using a baseline in policy gradient updates?

Study These Flashcards

subtracting a baseline from the return reduces the gradient’s variance, allowing for more efficient numerical updates and faster convergence.

What are the two main components of the policy gradient theorem?

Study These Flashcards

Estimating the gradient using sampled trajectories.
Updating the policy by increasing the probability of actions that lead to higher rewards.

Why is the softmax function used in policy optimization with a finite number of actions?

Study These Flashcards

The softmax function ensures that the policy outputs a probability distribution over actions by transforming the output of the neural network into probabilities.

At what stage of the neural network is the softmax function applied in policy optimization?

Study These Flashcards

at the last layer of the neural network to generate a valid probability distribution over actions.

How can value-based methods and policy-based methods be related in specific settings?

Study These Flashcards

In certain settings, depending on the loss function and entropy regularization, value-based methods and policy-based methods can become equivalent, with both aiming to find the optimal policy.

What is the main goal of benchmarking deep reinforcement learning algorithms?

Study These Flashcards

to evaluate their performance under controlled and reproducible conditions

How do we test the effectiveness of a reinforcement learning algorithm during benchmarking?

Study These Flashcards

By averaging its performance across multiple learning trials and, if possible, applying significance testing techniques to the results.

Why is it necessary to conduct multiple learning trials in benchmarking RL algorithms?

Study These Flashcards

Because stochasticity, such as randomness in neural network initialization and environment variability, can affect the results significantly.

Why should benchmarking results not be over-interpreted?

A hypothesis may hold for specific environments and hyperparameter settings but fail in other scenarios, so results may not generalize well.

What steps are important to ensure a fair comparison between RL algorithms?

Ensuring the use of the **same random seeds** and applying **identical hyperparameter tuning procedures** across algorithms.

Why is using top-K trials inadequate for fair benchmarking of RL algorithms?

It can exaggerate the performance of certain algorithms, as it only reflects the best outcomes rather than the average robustness.

What is a more reliable metric than top-K trials for reporting the performance of RL algorithms?

The average performance across all trials, as it better reflects the true robustness of an algorithm.

What are the three main components an RL agent may include? (may include one or more)

1. A representation of a value function for predicting how good a state or state-action pair is 2. A direct representation of the policy that maps states to actions 3. A model of the environment for planning.

How do model-free RL methods differ from model-based methods?

Model-free methods learn directly from interactions with the environment without building a model, while model-based methods involve learning a model of the environment and using it for planning actions.

What are the respective strengths of model-based and model-free RL methods?

1. Model-free methods are preferable when the agent lacks access to a generative model of the environment, avoiding inaccuracies in learned models. 2. Model-based methods work well with planning algorithms but can be computationally demanding. 3. For certain tasks (e.g., structured tasks like mazes), model-based methods can learn the environment more efficiently due to the specific nature of the task.

lecture 3 Flashcards

(31 cards)