RL: Chapter 3: Finite Markov Decision Processes Flashcards

1
Q

Markov Decision Process

Agent-Environment Interface

A

Agent: learner and decision maker
Environment: Everything outside the agent.

They interact continually, the agent selecting actions and the environment respending to these actions and presenting new situations to the agent.

The environment also gives rise to rewards, special numerical values that the agent seeks to maximise over time through its choice of actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

State, Action, Reward

A

At each time step t:
- the agent receives some representation of the environment’s state,
- and on that basis selects an action.
- One step later, in part as a consequence of its action, the agent receives a numerical reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Markov decision process

A

One in which the distribution of the next state and reward depends on the immediately preceding state and action.

Given the previous state and action, it does not depend at all on earlier states and actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reward Hypothesis

A

All we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called rewards).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Episodic tasks

A

Tasks with episodes.

Where an episode is a sequence of states ending in a special state, called the terminal state, followed by a reset to a standard starting state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Continuing tasks

A

Tasks where the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Value functions

A

Functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

The notion of “how good” is defined in terms of expected future rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Policy

A

A policy is a mapping from states to probabilities of selecting each possible action.

If the agent is following policy π at time t, then π (a|s) is the probability of selecting At = a if St = s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

State-value function for a policy

A

The value function of a state under a policy.

v_π(s) is the expected return when starting in s and following π thereafter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Action-value function for a policy

A

The value of taking an action in a state under a policy.

q_π(s, a) is the expected return starting from s, taking action a, and thereafter following policy π.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Optimal policy

A

A policy that is better than or equal to all other policies.

A policy π is defined to be better than or equal to a policy π’ if its expected return is greater than or equal to that of π’ for all states.

I.e. π ≥ π’ iff v_π(s) ≥ v_π’(s) for all s ∈ S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Greedy

A

The term greedy is used in Computer Science to describe:

Any search or decision procedure that selects alternatives based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better alternatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The solution to the Bellman optimality equation depends on 3 assumptions

A
  1. The dynamics of the environment are accurately known
  2. computational resources are sufficient to complete the calculation
  3. The states have the Markov property.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly