lecture 2 Flashcards
What is the primary focus of value-based reinforcement learning methods?
- estimating the expected reward for each state (or state-action pair) in an environment to determine how “good” it is to be in a particular state
- This information informs the agent’s decision-making process.
How do value-based reinforcement learning methods work?
- they rely on learning value functions that provide the expected return (reward) from a given state or state-action pair
- The agent uses these value functions to decide on the best actions indirectly.
What is the difference between model-free, model-based, value-based, and policy-based reinforcement learning?
- Model-free RL: The agent learns directly from experience without a model of the environment, focusing on learning a policy or a value function.
- Model-based RL: The agent learns a model of the environment and uses it for planning future actions.
- Value-based RL: Focuses on estimating value functions to indirectly determine an optimal policy.
- Policy-based RL: Directly optimizes the policy itself without learning a value function first.
Which common algorithm is an example of a value-based method in reinforcement learning?
Q-learning
What is the Bellman operator, and what does it define?
- defines a recursive relationship for the value function and Q-value function
- expresses the value of a state (or state-action pair) as the expected reward plus the discounted value of the future state, assuming the agent follows a certain policy
- this equation is key to iteratively updating the value of a state in value-based methods.
What are the key components of a Markov Decision Process (MDP)?
- X: state space
- A: action space
- T: transition function, specifying the probability distribution of next states given the current state and action
- R: reward function, specifying the immediate reward the agent receives after performing an action in a state
- γ: discount factor, indicating how much future rewards are valued relative to immediate rewards
What is the value function in reinforcement learning?
- V(x)
- represents the expected future return when the agent starts in state 𝑥 and follows policy 𝜋
- It is the expected sum of discounted rewards starting from state x
What is the Q-value function, and how does it differ from the value function?
- The Q-value function Q(x,a) represents the expected return of taking action a in state 𝑥 and then following policy 𝜋.
- Unlike the value function, the Q-value function provides a measure for specific state-action pairs, allowing for direct policy optimization by choosing the action that maximizes the Q-value.
How is the optimal policy derived using the Q-value function?
- π∗(x) = argmax_a Q(x,a)
- derived by selecting the action 𝑎 that maximizes the Q-value for a given state 𝑥
- In Q-learning, the agent updates its Q-values iteratively using the Bellman equation until they converge to the optimal Q-values.
What does the Bellman equation for the Q-function express?
- The Bellman equation expresses the Q-value of a state-action pair as the sum of the immediate reward and the expected future value of the next state.
- It provides a recursive way to compute the expected return for a given policy.
What is the key takeaway of the Bellman equation for the Q-function?
the Q-function recursively estimates the value of state-action pairs based on the rewards received from taking actions in the environment and the values of future states
What are the three main methods to obtain Q-values in reinforcement learning?
- Solving the system of Bellman equations: If the transition function T and reward function R are known, solve the equations directly.
- Dynamic programming: Initialize the Q-values and repeatedly apply Bellman iterations until convergence, assuming T and R are known.
- Reinforcement learning: Perform Bellman iterations from data (trials and errors in the environment) without prior knowledge of T and R.
How does reinforcement learning differ from dynamic programming in obtaining Q-values?
- Unlike dynamic programming, reinforcement learning does not require prior knowledge of the environment’s transition function T and reward function R
- Instead, it learns Q-values through interactions with the environment alone.
Which method is used in Q-learning to obtain Q-values, and why is it significant?
- Reinforcement learning
- This is significant because it is a model-free approach where the agent can learn optimal policies through interaction and experience without needing prior knowledge of the environment.
What is dynamic programming, and how is it used in reinforcement learning?
- method for solving problems by breaking them into simpler subproblems, solving each subproblem once, and storing the solutions to avoid redundant calculations.
- In reinforcement learning, it is used when the model (transition dynamics T and reward function R) is known and involves repeatedly applying Bellman equations to find value functions and optimal policies.
What is the chain problem in reinforcement learning?
- a simple environment where an agent can move between states by taking actions
- represented as a chain of states with possible transitions between states and actions that yield rewards.
How is the Bellman equation used to solve the chain problem?
- The Bellman equation is applied iteratively to update Q-values for each state-action pair.
- At each iteration, the previously updated Q-values are used to compute new ones until convergence is reached.
What is the process for solving the chain problem using tabular Q-values and dynamic programming?
- Initialize all Q-values to zero.
- Apply the Bellman iterations to update the Q-values for each state-action pair.
- Continue iterating until the Q-values converge to a fixed point.
What is the goal of Q-learning in the context of a grid-world MDP?
- to determine the value function V and the optimal policy 𝜋.
- The value function represents the maximum expected return (reward) the agent can obtain from each state
- the optimal policy maps each state to the action that maximizes the Q-value.
How is the value function computed in a grid-world MDP using Q-learning?
- V = max_a Q(x,a)
- the maximum Q-value is taken over all possible actions from the given state
How is the optimal policy determined in a grid-world MDP using Q-learning?
- π = argmax_a Q(x,a)
- selecting the action that maximizes the Q-value for each state
How does the agent propagate the value backward in a grid-world MDP?
The agent starts propagating the value backward from the terminal state
Why are function approximators used in Q-learning with deep learning?
- Function approximators are used to deal with the curse of dimensionality that arises when the state space or action space is large or continuous.
- A tabular approach fails in such cases due to the exponential growth in the number of states or actions.
When are function approximators needed in reinforcement learning?
- when the state space is large and/or continuous (as in DQN).
- when the action space is large and/or continuous.