Deep Reinforcement Learning Flashcards

1
Q

What is reinforcement learning in contrast to supervised learning?

A

Reinforcement learning is about optimal behavior whilst supervised learning is about imitating behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the agent state used for (same as state? I think so)? (s^t)

A

Its the information an agent uses to perform its next step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the policy function with respect to the state? (pi(s^t))

A

A probability distribution over actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does “return” mean in the context of reinforcement learning? (G_t)

A

The discounted future reward. Used to train the reinforcement learning algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the state-value function (v_pi(s)) tell us?

A

How good a state is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the action-value function (q_pi(s,a)) tell us?

A

Expected future reward from state s when taking action a. Tells us how good it is to take an action from state s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is actor-critic?

A
  • Combine policy and value based control
    􏰍- In policy gradient, scale the gradient by something 􏰁smarter􏰂 than observed reward Gt.
    -􏰍 Critic which judges how good each action is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Monte-Carlo policy evaluation work?

A

In order to estimate e.g. state-value function, we can average the return observed after visits of state s for N episodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the intuition behind Temporal difference learning?

A

Value estimates are not independent of each other, the reward of state s_t should be, on average, similar to the reward of state s_t+1. This difference is used to update value estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Bellman expectation equation?

A

Leads to an algorithm for policy evaluation through TD-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Bellman optimality equation?

A

Leads to an algorithm for policy optimization. It will try to estimate the optimal value function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main idea behind training in policY gradient methods?

A

Making actions with led to high rewards more likely in the future, and vice versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some of the main obstacles in RL?

A

Trial-and-error learning approach is computationally expensive.
Hard to create a suitable environment for RL agent to learn in. Too expensive to create a realistic enough environment.
Designing an appropriate reward function in a real world scenario.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between policy iteration and value iteration?

A

Policy iteration: 1. Start with random policy → 2. find value function of policy (policy evaluation) → 3. find improved policy based on previous value function (policy improvement) → 4. repeat from 2.
Value iteration: Start with random value function → find improved value function until optimal → find the optimal policy from optimal value function using Bellman optimality equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the main steps in actor-critic methods?

A

An alternation between policy update, and policy evaluation using both Policy and value based control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly