Kursusgang 13 (Reinforcement learning) Flashcards
What is reinforcement learning?
Given inputs from an environment, take actions that affect the environment and produce action sequences that maximize the expected scalar reward. Optimal outputs are discovered by a process of trial and error.
Mathematically define reinforcement learning
Concerned with how agents take actions a_t in an environment s_t, based on history and observation o_t, to maximize cumulative reward r_t. Need to find a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The environment is typically formulated as a Markov decision process.
Two classes of methods: dynamic programming and reinforcement learning.
* Main difference: reinforcement learning does not assume knowledge of an exact mathematical model of the Markov decision process, and they target large Markov decision process where exact methods become infeasible.
What is a Markov decision process?
A Markov decision process is an extension of Markov chains with the addition of actions (of an agent, external to the process/environment) and rewards.
Define a deterministic Markov decision process
A deterministic MDP is defined by:
The state space of the process, S
* s_t: State of agent at time t
The action space of the agent, A
* a_t: Action taken at time t
The transition function of the process, describing how the state changes as a result of agent actions (deterministric)
* s_{t+1} = f (s_t , a_t )
The reward function, evaluating the immediate action performance
* Reward prob: p (r_{t+1} | s_t , a_t ).
Define a stochastic Markov decision process
A deterministic MDP is defined by:
The state space of the process, S
* s_t: State of agent at time t
The action space of the agent, A
* a_t: Action taken at time t
The transition probability function of the process
* s_{t+1} = P(s_{t+1} | s_t , a_t )
The reward probability,
* Reward prob: p (r_{t+1} | s_t , a_t ).
What is a value function?
Value functions of states, V(s_t), estimate how good (i.e. expected return) it is for the agent to be in a given state.
Value functions of state-action pairs, Q(s_t, a_t), estimate how good it is to perform a given action in a given state.
Value functions are defined with respect to policies.
What is a policy?
A policy is a mapping from states to probabilities of selecting each possible action:
π (a | s) is the probability that A_t = a if S_t = s.
In a finite-horizon model, the agent maximizes the expected reward for the next T steps,
V^π(s_t) = E[ \sum_{i=1}^T r_{t+1}]
In an infinite horizon model, it maximizes the expected discounted returns, with the discount rate, 0 ≤ ς < 1,
V^π(s_t) = E[ \sum_{i=1}^T ς^{i-1} * r_{t+i}],
What is Bellman’s equation?
It describes the optimal value of a decision-making process in terms of the immediate reward and the value of subsequent states.
For state value function:
V(st) = max{at} ( E[r{t+1}] + ς \sum_{s_{t+1}} P(s{t+1} | st, at) V(s{t+1}) ).
* For action value function:
* Q (st) = E[r{t+1}] + ς \sum_{s_{t+1}} P(s{t+1} | st, a_t) max_{a_{t+1}} Q*(s_{t+1}, a_{t+1}).
What is Q-learning?
Q-learning is a model-free reinforcement learning algorithm that is used to find the optimal action-selection policy for an agent interacting with an environment. The key idea is for the agent to learn a policy that maximizes the expected cumulative reward by interacting with the environment over time.
Initialize all Q(s, a) arbitrarily
For all episodes
Initialize s
Repeat
Choose a using policy derived from Q
Take action a, observe r and s’
Update Q(s, a):
Q(s, a) ← Q(s, a) + η(r + ς * max_{a’} Q(s’, a’) - Q(s, a))
s ← s’
Until s is terminal state (when a certain condition or threshold is met).
What is the key idea of dynamic programming and reinforcement learning?
The key idea of dynamic programming and reinforcement learning is to use value functions to search for good policies. Once the optimal value functions are found, which can be either the value of a state under the optimal policy or the value of taking an action in a state under the optimal policy, then the optimal policies are easily obtained.
What is a grid world?
Grid worlds consists of a grid of cells (typically a 2D matrix), where each cell represents a state, and the agent can navigate between these states by taking actions. Grid worlds are often used to illustrate and test reinforcement learning algorithms because they provide a simple, intuitive environment with clearly defined states, actions, and rewards.
How can Q values be shown in a grid world?
Each cell with l neighbors are divided into l blocks in total, one toward each neighbor. The value of state-action pair function is then shown in that direction, thus more clearly showing what direction from a given state gives the highest expected value.