lecture 2 Flashcards

1
Q

What is the primary focus of value-based reinforcement learning methods?

A
  • estimating the expected reward for each state (or state-action pair) in an environment to determine how “good” it is to be in a particular state
  • This information informs the agent’s decision-making process.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do value-based reinforcement learning methods work?

A
  • they rely on learning value functions that provide the expected return (reward) from a given state or state-action pair
  • The agent uses these value functions to decide on the best actions indirectly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between model-free, model-based, value-based, and policy-based reinforcement learning?

A
  1. Model-free RL: The agent learns directly from experience without a model of the environment, focusing on learning a policy or a value function.
  2. Model-based RL: The agent learns a model of the environment and uses it for planning future actions.
  3. Value-based RL: Focuses on estimating value functions to indirectly determine an optimal policy.
  4. Policy-based RL: Directly optimizes the policy itself without learning a value function first.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which common algorithm is an example of a value-based method in reinforcement learning?

A

Q-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Bellman operator, and what does it define?

A
  • defines a recursive relationship for the value function and Q-value function
  • expresses the value of a state (or state-action pair) as the expected reward plus the discounted value of the future state, assuming the agent follows a certain policy
  • this equation is key to iteratively updating the value of a state in value-based methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the key components of a Markov Decision Process (MDP)?

A
  1. X: state space
  2. A: action space
  3. T: transition function, specifying the probability distribution of next states given the current state and action
  4. R: reward function, specifying the immediate reward the agent receives after performing an action in a state
  5. γ: discount factor, indicating how much future rewards are valued relative to immediate rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the value function in reinforcement learning?

A
  • V(x)
  • represents the expected future return when the agent starts in state 𝑥 and follows policy 𝜋
  • It is the expected sum of discounted rewards starting from state x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Q-value function, and how does it differ from the value function?

A
  • The Q-value function Q(x,a) represents the expected return of taking action a in state 𝑥 and then following policy 𝜋.
  • Unlike the value function, the Q-value function provides a measure for specific state-action pairs, allowing for direct policy optimization by choosing the action that maximizes the Q-value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is the optimal policy derived using the Q-value function?

A
  • π∗(x) = argmax_a Q(x,a)
  • derived by selecting the action 𝑎 that maximizes the Q-value for a given state 𝑥
  • In Q-learning, the agent updates its Q-values iteratively using the Bellman equation until they converge to the optimal Q-values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the Bellman equation for the Q-function express?

A
  • The Bellman equation expresses the Q-value of a state-action pair as the sum of the immediate reward and the expected future value of the next state.
  • It provides a recursive way to compute the expected return for a given policy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the key takeaway of the Bellman equation for the Q-function?

A

the Q-function recursively estimates the value of state-action pairs based on the rewards received from taking actions in the environment and the values of future states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three main methods to obtain Q-values in reinforcement learning?

A
  1. Solving the system of Bellman equations: If the transition function T and reward function R are known, solve the equations directly.
  2. Dynamic programming: Initialize the Q-values and repeatedly apply Bellman iterations until convergence, assuming T and R are known.
  3. Reinforcement learning: Perform Bellman iterations from data (trials and errors in the environment) without prior knowledge of T and R.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does reinforcement learning differ from dynamic programming in obtaining Q-values?

A
  • Unlike dynamic programming, reinforcement learning does not require prior knowledge of the environment’s transition function T and reward function R
  • Instead, it learns Q-values through interactions with the environment alone.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which method is used in Q-learning to obtain Q-values, and why is it significant?

A
  • Reinforcement learning
  • This is significant because it is a model-free approach where the agent can learn optimal policies through interaction and experience without needing prior knowledge of the environment.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is dynamic programming, and how is it used in reinforcement learning?

A
  • method for solving problems by breaking them into simpler subproblems, solving each subproblem once, and storing the solutions to avoid redundant calculations.
  • In reinforcement learning, it is used when the model (transition dynamics T and reward function R) is known and involves repeatedly applying Bellman equations to find value functions and optimal policies.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the chain problem in reinforcement learning?

A
  • a simple environment where an agent can move between states by taking actions
  • represented as a chain of states with possible transitions between states and actions that yield rewards.
17
Q

How is the Bellman equation used to solve the chain problem?

A
  • The Bellman equation is applied iteratively to update Q-values for each state-action pair.
  • At each iteration, the previously updated Q-values are used to compute new ones until convergence is reached.
18
Q

What is the process for solving the chain problem using tabular Q-values and dynamic programming?

A
  1. Initialize all Q-values to zero.
  2. Apply the Bellman iterations to update the Q-values for each state-action pair.
  3. Continue iterating until the Q-values converge to a fixed point.
19
Q

What is the goal of Q-learning in the context of a grid-world MDP?

A
  • to determine the value function V and the optimal policy 𝜋.
  • The value function represents the maximum expected return (reward) the agent can obtain from each state
  • the optimal policy maps each state to the action that maximizes the Q-value.
20
Q

How is the value function computed in a grid-world MDP using Q-learning?

A
  • V = max_a Q(x,a)
  • the maximum Q-value is taken over all possible actions from the given state
21
Q

How is the optimal policy determined in a grid-world MDP using Q-learning?

A
  • π = argmax_a Q(x,a)
  • selecting the action that maximizes the Q-value for each state
22
Q

How does the agent propagate the value backward in a grid-world MDP?

A

The agent starts propagating the value backward from the terminal state

23
Q

Why are function approximators used in Q-learning with deep learning?

A
  • Function approximators are used to deal with the curse of dimensionality that arises when the state space or action space is large or continuous.
  • A tabular approach fails in such cases due to the exponential growth in the number of states or actions.
24
Q

When are function approximators needed in reinforcement learning?

A
  1. when the state space is large and/or continuous (as in DQN).
  2. when the action space is large and/or continuous.
25
Q

How does Q-learning with a function approximator work in continuous spaces?

A
  • the Q-value function is represented using a function approximator with parameters
  • the parameters are updated using gradient descent to minimize the error between predicted Q-values and target Q-values
26
Q

What are the key components of the DQN algorithm?

A
  1. Q-network: A neural network that approximates the Q-values with parameters θ.
  2. Replay memory: Stores past experiences to break temporal correlations and improve learning efficiency.
  3. Target network: A separate network that stabilizes training by being updated periodically with the weights from the current Q-network, preventing instability from frequent updates.
27
Q

How does the target network help stabilize training in the DQN algorithm?

A
  • by holding its weights fixed for a certain number of iterations and updating them periodically from the Q-network
  • This reduces oscillations and instability caused by constantly changing Q-values during training.
28
Q

What is the DQN update process during training?

A
  1. Samples a mini-batch of experiences from the replay memory.
  2. Updates the Q-values using the Bellman update rule with the current Q-network and target network.
  3. Derives the policy by selecting the action that maximizes the Q-value at each state.
29
Q

What is the primary aim of Distributional DQN?

A

aims to model the distribution of possible cumulative returns, rather than just the expectation (mean) of the returns, providing a richer representation of future rewards.

30
Q

How does Distributional DQN represent the state-action value function differently from standard DQN?

A
  • in distributional DQN, the state value function is derived from a distribution Z
  • the expected valua of this distribution gives the Q-value
31
Q

What are the advantages of using Distributional DQN?

A
  1. Risk-aware behavior: By considering the distribution of returns, it can learn policies that account for risk and uncertainty.
  2. Improved performance: The richer representation leads to more informative learning signals, enhancing learning efficiency.
32
Q

What is multi-step learning in the context of DQN?

A
  • extends DQN by considering multiple time steps to compute the target value, rather than relying solely on the immediate next state
  • this approach improves the learning process by using more information from future states
33
Q

How does the learning algorithm work in multi-step learning?

A

The learning algorithm in multi-step learning bootstraps by recursively using its own value estimates across multiple time steps to compute a more accurate target value.

34
Q

How is the discount factor related to the concept of delayed gratification?

A

A higher discount factor represents the agent’s ability to wait for bigger rewards, similar to how humans develop the ability to delay gratification as they age.

35
Q

What is the advantage of using an adaptive discount factor in deep reinforcement learning?

A
  • An adaptive discount factor allows the agent to dynamically adjust how it values future rewards during the learning process.
  • This makes the agent more flexible and potentially more effective compared to using a fixed discount factor, as it can improve performance by better balancing immediate and future rewards.