Kursusgang 13 (Reinforcement learning) Flashcards

1
Q

What is reinforcement learning?

A

Given inputs from an environment, take actions that affect the environment and produce action sequences that maximize the expected scalar reward. Optimal outputs are discovered by a process of trial and error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mathematically define reinforcement learning

A

Concerned with how agents take actions a_t in an environment s_t, based on history and observation o_t, to maximize cumulative reward r_t. Need to find a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The environment is typically formulated as a Markov decision process.
Two classes of methods: dynamic programming and reinforcement learning.
* Main difference: reinforcement learning does not assume knowledge of an exact mathematical model of the Markov decision process, and they target large Markov decision process where exact methods become infeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Markov decision process?

A

A Markov decision process is an extension of Markov chains with the addition of actions (of an agent, external to the process/environment) and rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define a deterministic Markov decision process

A

A deterministic MDP is defined by:
The state space of the process, S
* s_t: State of agent at time t
The action space of the agent, A
* a_t: Action taken at time t
The transition function of the process, describing how the state changes as a result of agent actions (deterministric)
* s_{t+1} = f (s_t , a_t )
The reward function, evaluating the immediate action performance
* Reward prob: p (r_{t+1} | s_t , a_t ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define a stochastic Markov decision process

A

A deterministic MDP is defined by:
The state space of the process, S
* s_t: State of agent at time t
The action space of the agent, A
* a_t: Action taken at time t
The transition probability function of the process
* s_{t+1} = P(s_{t+1} | s_t , a_t )
The reward probability,
* Reward prob: p (r_{t+1} | s_t , a_t ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a value function?

A

Value functions of states, V(s_t), estimate how good (i.e. expected return) it is for the agent to be in a given state.
Value functions of state-action pairs, Q(s_t, a_t), estimate how good it is to perform a given action in a given state.
Value functions are defined with respect to policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a policy?

A

A policy is a mapping from states to probabilities of selecting each possible action:
π (a | s) is the probability that A_t = a if S_t = s.
In a finite-horizon model, the agent maximizes the expected reward for the next T steps,
V^π(s_t) = E[ \sum_{i=1}^T r_{t+1}]
In an infinite horizon model, it maximizes the expected discounted returns, with the discount rate, 0 ≤ ς < 1,
V^π(s_t) = E[ \sum_{i=1}^T ς^{i-1} * r_{t+i}],

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bellman’s equation?

A

It describes the optimal value of a decision-making process in terms of the immediate reward and the value of subsequent states.
For state value function:
V(st) = max{at} ( E[r{t+1}] + ς \sum_{s_{t+1}} P(s{t+1} | st, at) V(s{t+1}) ).
* For action value function:
* Q (st) = E[r{t+1}] + ς \sum_{s_{t+1}} P(s{t+1} | st, a_t) max_{a_{t+1}} Q*(s_{t+1}, a_{t+1}).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Q-learning?

A

Q-learning is a model-free reinforcement learning algorithm that is used to find the optimal action-selection policy for an agent interacting with an environment. The key idea is for the agent to learn a policy that maximizes the expected cumulative reward by interacting with the environment over time.

Initialize all Q(s, a) arbitrarily
For all episodes
Initialize s
Repeat
Choose a using policy derived from Q
Take action a, observe r and s’
Update Q(s, a):
Q(s, a) ← Q(s, a) + η(r + ς * max_{a’} Q(s’, a’) - Q(s, a))
s ← s’
Until s is terminal state (when a certain condition or threshold is met).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the key idea of dynamic programming and reinforcement learning?

A

The key idea of dynamic programming and reinforcement learning is to use value functions to search for good policies. Once the optimal value functions are found, which can be either the value of a state under the optimal policy or the value of taking an action in a state under the optimal policy, then the optimal policies are easily obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a grid world?

A

Grid worlds consists of a grid of cells (typically a 2D matrix), where each cell represents a state, and the agent can navigate between these states by taking actions. Grid worlds are often used to illustrate and test reinforcement learning algorithms because they provide a simple, intuitive environment with clearly defined states, actions, and rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can Q values be shown in a grid world?

A

Each cell with l neighbors are divided into l blocks in total, one toward each neighbor. The value of state-action pair function is then shown in that direction, thus more clearly showing what direction from a given state gives the highest expected value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly