Week 5 Flashcards

Question

What does exploration mean in decision-making strategies?

Answer 1

Exploration refers to taking different actions to gather more information about the environment, which can potentially lead to discovering more valuable actions.

Answer 2

The issue with exploration is that it does not necessarily maximize immediate reward, which is the goal of reinforcement learning (RL).

Answer 3

A balance between exploitation and exploration is necessary to achieve optimal decision-making, ensuring both the use of the best-known options and the discovery of potentially better choices.

Answer 4

every x trial (epsilon) we choose a random action. (exploration).

Answer 5

exploration= completly random directed exploration tries to pick actions that have not been chosen yet or have not been chosen a much recent.

Answer 6

a fundamental concept in reinforcement learning. The equation is used to find the optimal policy, which tells an agent the best action to take in every state.

Answer 7

used to find the optimal policy by iteratively improving the value function of each state. It is a form of dynamic programming that solves the Bellman Optimality Equation. 1. Initialization: You start by initializing the value V for all states s in your state space to arbitrary values, except for the terminal states which might be initialized to the final reward or zero. 2. Iteration: For each state s, you update the value V (s) V(s) by using the Bellman Optimality Equation.

Answer 8

DP is computationally expensive, especially as the size of the state space grows, which can make it impractical for large or complex environments.

Answer 9

DP assumes a perfect model of the environment, meaning it requires perfect knowledge of all state transitions and rewards.

Answer 10

Without perfect knowledge, DP cannot accurately predict the state transitions and rewards, which are necessary for finding the optimal policy.

Answer 11

Monte Carlo RL is characterized by learning directly from experience, without requiring a model of the environment's dynamics. It has to finish. It does not compare with other states.

Answer 12

important to notice that you start at the reward (15) and go backwards to the first state (0).

Answer 13

The convergence to the optimum can be slow, requiring a lot of sampling, and sufficient exploration must be maintained throughout the learning process.

Answer 14

This means that the learning process follows the policy that is currently being improved upon. In other words, the agent learns about the policy it is using to make decisions, as opposed to "off-policy" methods like Q-learning, where the agent learns about a potentially different policy from the one it follows.

Answer 15

On-policy methods, such as SARSA, ensure that the policy being evaluated and improved is the same as the policy being used to make decisions, which often leads to more stable and consistent learning.

Answer 16

On-policy methods can be less efficient than off-policy methods because they can only learn from the current policy, potentially leading to slower convergence to the optimal policy.

Answer 17

Off-policy methods, like Q-learning, can learn from data generated by any policy (exploratory or even suboptimal), making them more flexible and often faster at finding the optimal policy.

Answer 18

Off-policy methods can be less stable and more complex to implement because they must correctly account for the difference between the policy being evaluated and the policy used to generate the data.

Answer 19

the curse of dimensionality refers to the phenomenon where the volume of the state space increases exponentially with each additional dimension, making computational problems much more complex.

Answer 20

The credit assignment problem is the challenge of determining which actions or decisions led to a particular outcome, especially when many decisions are involved over time.

Answer 21

The first solution is Standard Temporal Difference (TD) Learning, which updates the value of a state-action pair based on the difference between the expected future rewards and the actual rewards received

Answer 22

he second solution is the Monte Carlo method, which uses the "long-term" memory of actions and rewards to update values, relying on complete episodes to make updates.

Answer 23

The third solution is Eligibility Traces, which provide a "short-term" memory of actions to bridge the temporal gap, allowing for credit to be assigned more accurately to actions that lead to a reward.

Answer 24

Model-free RL is simple and efficient, making it accessible and straightforward to implement without the need for a model of the environment.

Answer 25

Model-free RL can be slow and rigid. It often leads to outcome insensitivity, where the learning process doesn't adjust adequately to changes in the environment.

Answer 26

Model-based RL is fast and flexible. It allows for behavioral adjustments through planning, as it involves a model of the environment which can simulate future states.

Answer 27

Model-based RL is complex and costly. It requires a significant amount of computational resources to model the environment and update the model based on new information.

Answer 28

Human behavior is often seen as a mixture of both model-free and model-based RL, utilizing the simplicity of model-free methods and the strategic planning of model-based methods.

Week 5 Flashcards

(53 cards)