Deep Reinforcement Learning Flashcards
What is reinforcement learning in contrast to supervised learning?
Reinforcement learning is about optimal behavior whilst supervised learning is about imitating behavior.
What is the agent state used for (same as state? I think so)? (s^t)
Its the information an agent uses to perform its next step
What is the policy function with respect to the state? (pi(s^t))
A probability distribution over actions
What does “return” mean in the context of reinforcement learning? (G_t)
The discounted future reward. Used to train the reinforcement learning algorithm.
What does the state-value function (v_pi(s)) tell us?
How good a state is
What does the action-value function (q_pi(s,a)) tell us?
Expected future reward from state s when taking action a. Tells us how good it is to take an action from state s.
What is actor-critic?
- Combine policy and value based control
- In policy gradient, scale the gradient by something smarter than observed reward Gt.
- Critic which judges how good each action is
How does Monte-Carlo policy evaluation work?
In order to estimate e.g. state-value function, we can average the return observed after visits of state s for N episodes.
What is the intuition behind Temporal difference learning?
Value estimates are not independent of each other, the reward of state s_t should be, on average, similar to the reward of state s_t+1. This difference is used to update value estimation.
What is the Bellman expectation equation?
Leads to an algorithm for policy evaluation through TD-learning
What is the Bellman optimality equation?
Leads to an algorithm for policy optimization. It will try to estimate the optimal value function.
What is the main idea behind training in policY gradient methods?
Making actions with led to high rewards more likely in the future, and vice versa.
What are some of the main obstacles in RL?
Trial-and-error learning approach is computationally expensive.
Hard to create a suitable environment for RL agent to learn in. Too expensive to create a realistic enough environment.
Designing an appropriate reward function in a real world scenario.
What is the difference between policy iteration and value iteration?
Policy iteration: 1. Start with random policy → 2. find value function of policy (policy evaluation) → 3. find improved policy based on previous value function (policy improvement) → 4. repeat from 2.
Value iteration: Start with random value function → find improved value function until optimal → find the optimal policy from optimal value function using Bellman optimality equation.
What are the main steps in actor-critic methods?
An alternation between policy update, and policy evaluation using both Policy and value based control.