RL: Chapter 3: Finite Markov Decision Processes Flashcards

Question 1

Q

Markov Decision Process

Agent-Environment Interface

Answer

A

Agent: learner and decision maker
Environment: Everything outside the agent.

They interact continually, the agent selecting actions and the environment respending to these actions and presenting new situations to the agent.

The environment also gives rise to rewards, special numerical values that the agent seeks to maximise over time through its choice of actions.

Question 2

Q

State, Action, Reward

Answer

A

At each time step t:
- the agent receives some representation of the environment’s state,
- and on that basis selects an action.
- One step later, in part as a consequence of its action, the agent receives a numerical reward.

Question 3

Q

Markov decision process

Answer

A

One in which the distribution of the next state and reward depends on the immediately preceding state and action.

Given the previous state and action, it does not depend at all on earlier states and actions.

Question 4

Q

Reward Hypothesis

Answer

A

All we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called rewards).

Question 5

Q

Episodic tasks

Answer

A

Tasks with episodes.

Where an episode is a sequence of states ending in a special state, called the terminal state, followed by a reset to a standard starting state.

Question 6

Q

Continuing tasks

Answer

A

Tasks where the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit.

Question 7

Q

Value functions

Answer

A

Functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

The notion of “how good” is defined in terms of expected future rewards.

Question 8

Q

Policy

Answer

A

A policy is a mapping from states to probabilities of selecting each possible action.

If the agent is following policy π at time t, then π (a|s) is the probability of selecting At = a if St = s.

Question 9

Q

State-value function for a policy

Answer

A

The value function of a state under a policy.

v_π(s) is the expected return when starting in s and following π thereafter.

Question 10

Q

Action-value function for a policy

Answer

A

The value of taking an action in a state under a policy.

q_π(s, a) is the expected return starting from s, taking action a, and thereafter following policy π.

Question 11

Q

Optimal policy

Answer

A

A policy that is better than or equal to all other policies.

A policy π is defined to be better than or equal to a policy π’ if its expected return is greater than or equal to that of π’ for all states.

I.e. π ≥ π’ iff v_π(s) ≥ v_π’(s) for all s ∈ S.

Question 12

Q

Greedy

Answer

A

The term greedy is used in Computer Science to describe:

Any search or decision procedure that selects alternatives based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better alternatives.

Question 13

Q

The solution to the Bellman optimality equation depends on 3 assumptions

Answer

A

The dynamics of the environment are accurately known
computational resources are sufficient to complete the calculation
The states have the Markov property.

RL: Chapter 3: Finite Markov Decision Processes Flashcards

(13 cards)