Deep Reinforcement Learning Flashcards

Question 1

Q

What is reinforcement learning in contrast to supervised learning?

Answer

A

Reinforcement learning is about optimal behavior whilst supervised learning is about imitating behavior.

Question 2

Q

What is the agent state used for (same as state? I think so)? (s^t)

Answer

A

Its the information an agent uses to perform its next step

Question 3

Q

What is the policy function with respect to the state? (pi(s^t))

Answer

A

A probability distribution over actions

Question 4

Q

What does “return” mean in the context of reinforcement learning? (G_t)

Answer

A

The discounted future reward. Used to train the reinforcement learning algorithm.

Question 5

Q

What does the state-value function (v_pi(s)) tell us?

Answer

A

How good a state is

Question 6

Q

What does the action-value function (q_pi(s,a)) tell us?

Answer

A

Expected future reward from state s when taking action a. Tells us how good it is to take an action from state s.

Question 7

Q

What is actor-critic?

Answer

A

Combine policy and value based control
􏰍- In policy gradient, scale the gradient by something 􏰁smarter􏰂 than observed reward Gt.
-􏰍 Critic which judges how good each action is

Question 8

Q

How does Monte-Carlo policy evaluation work?

Answer

A

In order to estimate e.g. state-value function, we can average the return observed after visits of state s for N episodes.

Question 9

Q

What is the intuition behind Temporal difference learning?

Answer

A

Value estimates are not independent of each other, the reward of state s_t should be, on average, similar to the reward of state s_t+1. This difference is used to update value estimation.

Question 10

Q

What is the Bellman expectation equation?

Answer

A

Leads to an algorithm for policy evaluation through TD-learning

Question 11

Q

What is the Bellman optimality equation?

Answer

A

Leads to an algorithm for policy optimization. It will try to estimate the optimal value function.

Question 12

Q

What is the main idea behind training in policY gradient methods?

Answer

A

Making actions with led to high rewards more likely in the future, and vice versa.

Question 13

Q

What are some of the main obstacles in RL?

Answer

A

Trial-and-error learning approach is computationally expensive.
Hard to create a suitable environment for RL agent to learn in. Too expensive to create a realistic enough environment.
Designing an appropriate reward function in a real world scenario.

Question 14

Q

What is the difference between policy iteration and value iteration?

Answer

A

Policy iteration: 1. Start with random policy → 2. find value function of policy (policy evaluation) → 3. find improved policy based on previous value function (policy improvement) → 4. repeat from 2.
Value iteration: Start with random value function → find improved value function until optimal → find the optimal policy from optimal value function using Bellman optimality equation.

Question 15

Q

What are the main steps in actor-critic methods?

Answer

A

An alternation between policy update, and policy evaluation using both Policy and value based control.

Deep Reinforcement Learning Flashcards

(15 cards)