Module 9: Reinforcement Learning Flashcards

1
Q

Which of the following is given to the agent when using passive reinforcement learning? Check all that apply.

The agent’s policy π.
All of the states that can be reached by the agent.
The transition model P(s’ | s, a).
The reward function R(s) that specifies the reward for each state.

A

The agent’s policy π.
All of the states that can be reached by the agent.

The goal of passive reinforcement learning is to compute values for each state under policy π, the transition and reward are not required for direction evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

T/F
Direct utility estimation computes a Utility function U by considering the connection between the utility of a state and the utility of its successor states.

A

False

Direct utility estimation only use the sample transitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following is true of both adaptive dynamic programming (ADP) learning and temporal-difference (TD) learning?

A

The utility of a state is adjusted locally to agree with the utility of at least one successor state.

The utility of a state is adjusted locally to agree with the utility of at least one successor state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

T/F
Q-learning with linear function approximation (weighted linear function of a set of features) will always converge to the optimal policy.

A

False

It is possible for Q learning with linear function approximation to converge to the optimal policy, but there is no guarantee that it always finds such an optimal policy. For example, consider the problem where the true Q-function is a quadratic function. Q(s, a) where a= 0 or 1 and s can be any real number and the true function is Q(s,0) = s^2 and Q(s,1) = 3s^4 for all s. Then if our function approximator is linear Q(s,a) = ks + la + b, then there exists no k,l, and b that can represent the true Q function because we are trying to fit a linear to a quadratic function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In reinforcement learning, a deterministic policy is

A

A mapping from states to actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following equations are updates for TD Q-learning? Check all that apply.

𝑄(𝑠,𝑎)⟵𝑄(𝑠,𝑎)+𝛼(𝑅(𝑠)+𝛾max𝑎1𝑄(𝑠1,𝑎1)−𝑄(𝑠,𝑎))
𝑄(𝑠,𝑎)⟵𝑄(𝑠,𝑎)+𝑎(𝑅(𝑠)+Υ𝑄(𝑠1,𝑎1)−𝑄(𝑠,𝑎)

A

𝑄(𝑠,𝑎)⟵𝑄(𝑠,𝑎)+𝛼(𝑅(𝑠)+𝛾max𝑎1𝑄(𝑠1,𝑎1)−𝑄(𝑠,𝑎))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly