Question 1

What is the 5-tuple of a Markov Decision Process?

Accepted Answer

(S, A, T_a, R_a, gamma) S: state space A: action space T_a: transition fuction in the environment R_a: reward gamma: discount factor representing the difference between future and present rewards.

Question 2

Value-based methods, policy-based methods work well for discrete or continuous spaces?

Accepted Answer

Value-based: Discrete Policy-based: Discrete & Continuous

Question 3

Actions & Rewards learning in a tree, which directions?

Accepted Answer

Actions: downward to the leaves Rewards: upward, backpropagation to the root

Question 4

what is a sequential decision problem?

Accepted Answer

the agent has to make a sequence of decisions in order to solve a problem

Question 5

what is the Markov property?

Accepted Answer

the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)

Question 6

What is a policy pi(as)?

Accepted Answer

a conditional probability distribution that for each possible state specifies the probability of each possible action.

Question 7

What's on-policy?

Accepted Answer

The learning takes place by consistently backing up the value of the selected action back to the same behavior policy function that was used to select the action SARSA is on-policy

Question 8

What's off-policy?

Accepted Answer

the learning takes place by backing up values of another action, not the one selected by the behavior policy
Q-learning is off-policy and it is greedy: backup the value of the best action
convergence in off-policy can be slower, since older, non-current, values are used.

The behavior policy and the target policy are different in Q-learning, but they are the same in SARSA.

Question 9

SARSA update formula? (𝛼=learning rate, 𝛾=discount factor)

Accepted Answer

𝑄(𝑠𝑡, 𝑎𝑡) ← 𝑄(𝑠𝑡, 𝑎𝑡) + 𝛼[𝑟𝑡+1 + 𝛾𝑄(𝑠𝑡+1, 𝑎𝑡+1) − 𝑄(𝑠𝑡, 𝑎𝑡)]

Question 10

Q-learning update formula?

Accepted Answer

𝑄(𝑠𝑡, 𝑎𝑡) ← 𝑄(𝑠𝑡, 𝑎𝑡) + 𝛼[𝑟𝑡+1 + 𝛾 max𝑎𝑄(𝑠𝑡+1, 𝑎) − 𝑄(𝑠𝑡, 𝑎𝑡)]

Question 11

In reinforcement learning the agent can choose which training examples are generated. Why is this beneficial? What is a potential problem?

Accepted Answer

We can generate a endless dataset ourselves, through simulations. But on the other hand, we don't have the 'gold standard' actions given a state, nothing is labeled. We have to derive the correct policy ourselves.

Question 12

What is Grid world?

Accepted Answer

Grid worlds are the simplest environments, it consists of a rectangular grid of squares, with a start square, and a goal square.

Question 13

In a tree diagram, is successor selection of behavior up or down? In a tree diagram, is learning values through backpropagation up or down?

Accepted Answer

Learning up Selecting down

Question 14

What is 𝜏? What is 𝜋(s) What is 𝑉(𝑠)? What is 𝑄(𝑠, 𝑎)?

Accepted Answer

𝜏: Trace, a full rollout of a simulation.
𝜋(s): the policy function answers the question how the different actions 𝑎 at state 𝑠 should be chosen
𝑉(𝑠): The expected cumulative discounted future reward of a state
𝑄(𝑠, 𝑎): Q-value estimate, estimated value of taking action 𝑎 in state 𝑠

Question 15

What is dynamic programming?

Accepted Answer

In the context of RL, in Dynamic Programming we recursively traverse the state space. An example algorithm is Value-iteration.

W2 MDP & Tabular Value-based Flashcards

(25 cards)