lecture 3 Flashcards

1
Q

What are policy-based methods in reinforcement learning?

A
  • a class of reinforcement learning algorithms that learn a policy directly, instead of learning a value function like in value-based methods
  • A policy maps states to actions and can be either deterministic or stochastic.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between deterministic and stochastic policies in policy-based methods?

A
  1. Deterministic policy: Always chooses the same action given a specific state.
  2. Stochastic policy: Defines a probability distribution over actions for each state, allowing for exploration.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are policy-based methods useful for stochastic policies?

A
  • useful for building policies that can explicitly explore
  • are particularly effective in multi-agent systems where the Nash equilibrium is a stochastic policy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What environments are are policy-based methods advantageous in?

A

environments with continuous action spaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a stationary policy in reinforcement learning?

A
  • a policy that does not change over time, meaning it remains the same across different time steps
  • suitable for infinite-horizon problems where the agent seeks to maximize long-term rewards over an indefinite period
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a non-stationary policy in reinforcement learning?

A
  • the policy depends on the time step
  • is useful for finite-horizon problems, where the agent seeks to optimize cumulative rewards over a limited number of future time steps.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the key difference between off-policy and on-policy methods?

A
  1. Off-policy methods: Evaluate or improve a policy that is different from the one used to generate the data. They are more sample efficient as they can reuse data from different policies. Example: Q-learning.
  2. On-policy methods: Evaluate or improve the policy that is currently being used by the agent. They may suffer from bias if a replay buffer is used, as stored trajectories may not correspond to the current policy. Example: SARSA.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are off-policy methods considered more general than on-policy methods?

A

because they can learn from data generated by any behavior policy, not just the current policy being used by the agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are policy gradient methods in reinforcement learning?

A
  • Policy gradient methods optimize a performance objective (typically the expected cumulative reward) by finding a good policy, often parameterized by a neural network.
  • These methods use stochastic gradient ascent to improve the policy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the goal of policy gradient methods in reinforcement learning?

A

to find a good policy by optimizing a performance objective, typically the expected cumulative reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is the policy gradient approach useful in reinforcement learning?

A

It allows the gradient to be estimated using experience, and the policy improvement step increases the probability of actions proportionally to their expected return.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the REINFORCE algorithm?

A

The REINFORCE algorithm is an on-policy gradient-based method that estimates the return for each action using Monte Carlo rollouts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the main advantages and disadvantages of the REINFORCE algorithm?

A
  1. Advantage: It provides an unbiased estimate of the gradient.
  2. Disadvantage: It can have high variance and requires many rollouts to converge.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do actor-critic methods differ from the REINFORCE algorithm?

A

Actor-critic methods use a value-based approach to estimate the return, which makes them more efficient by reducing variance compared to pure Monte Carlo methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the role of the entropy regularizer in policy gradient methods?

A

encourages the policy to remain stochastic, preventing it from becoming deterministic too quickly and ensuring better exploration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is a baseline used in policy gradient methods, and how does it improve efficiency?

A

reduces the variance of the gradient estimator without introducing bias, leading to more stable and efficient updates.

17
Q

What is the key concept behind using a baseline in policy gradient updates?

A

subtracting a baseline from the return reduces the gradient’s variance, allowing for more efficient numerical updates and faster convergence.

18
Q

What are the two main components of the policy gradient theorem?

A
  1. Estimating the gradient using sampled trajectories.
  2. Updating the policy by increasing the probability of actions that lead to higher rewards.
19
Q

Why is the softmax function used in policy optimization with a finite number of actions?

A

The softmax function ensures that the policy outputs a probability distribution over actions by transforming the output of the neural network into probabilities.

20
Q

At what stage of the neural network is the softmax function applied in policy optimization?

A

at the last layer of the neural network to generate a valid probability distribution over actions.

21
Q

How can value-based methods and policy-based methods be related in specific settings?

A

In certain settings, depending on the loss function and entropy regularization, value-based methods and policy-based methods can become equivalent, with both aiming to find the optimal policy.

22
Q

What is the main goal of benchmarking deep reinforcement learning algorithms?

A

to evaluate their performance under controlled and reproducible conditions

23
Q

How do we test the effectiveness of a reinforcement learning algorithm during benchmarking?

A

By averaging its performance across multiple learning trials and, if possible, applying significance testing techniques to the results.

24
Q

Why is it necessary to conduct multiple learning trials in benchmarking RL algorithms?

A

Because stochasticity, such as randomness in neural network initialization and environment variability, can affect the results significantly.

25
Q

Why should benchmarking results not be over-interpreted?

A

A hypothesis may hold for specific environments and hyperparameter settings but fail in other scenarios, so results may not generalize well.

26
Q

What steps are important to ensure a fair comparison between RL algorithms?

A

Ensuring the use of the same random seeds and applying identical hyperparameter tuning procedures across algorithms.

27
Q

Why is using top-K trials inadequate for fair benchmarking of RL algorithms?

A

It can exaggerate the performance of certain algorithms, as it only reflects the best outcomes rather than the average robustness.

28
Q

What is a more reliable metric than top-K trials for reporting the performance of RL algorithms?

A

The average performance across all trials, as it better reflects the true robustness of an algorithm.

29
Q

What are the three main components an RL agent may include? (may include one or more)

A
  1. A representation of a value function for predicting how good a state or state-action pair is
  2. A direct representation of the policy that maps states to actions
  3. A model of the environment for planning.
30
Q

How do model-free RL methods differ from model-based methods?

A

Model-free methods learn directly from interactions with the environment without building a model, while model-based methods involve learning a model of the environment and using it for planning actions.

31
Q

What are the respective strengths of model-based and model-free RL methods?

A
  1. Model-free methods are preferable when the agent lacks access to a generative model of the environment, avoiding inaccuracies in learned models.
  2. Model-based methods work well with planning algorithms but can be computationally demanding.
  3. For certain tasks (e.g., structured tasks like mazes), model-based methods can learn the environment more efficiently due to the specific nature of the task.