Week 3 Flashcards
Sequential Descion Problem
A problem where the agent’s utility depends on a sequence of decisions. This problem is not episodic, but sequential, meaning the current decision could affect all future decisions. Sequential does not mean that the current decision is made based upon all past decisions.
Markov Decision Process (MDP)
A sequential decision problem that is fully observable, stochastic, additive
MDP provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.
MDPs are usually solved with dynamic programming techniques such as value iteration, policy iteration, linear programming, Monte-Carlo planning.
Multi-armed Bandit Problem
Consider an agent taking many possible actions, one after another, each with an unknown reward. The agent wants to maximize the reward it gets from taking all of these actions by balancing exploration (trying to understand which action is the best) and exploitation (selecting the action that at the moment has the highest reward). This is a type of reinforcement learning problem.
Famous ex: Suppose we have x slot machines, each with a different reward and different probabilities of winning that reward. We want to roll y slot machines such that we end up with the greatest amount of money. The agent has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.
Action Value Function (Sample-Averaging)
The action value function, Q(a), maps an action ‘a’ to a real number representing the value we obtain from taking this action. In the multi-armed bandit problem, we do not know the true value of taking the action and so we simply estimate the value of an action with sample-averaging.
Sample-averaging means we define Q(a) as:
Q(a) = E[r_t | a_t]
where r_t is the reward obtained at time t and a_t is the action taken at time t. (We take one action and obtain one reward at every instant in time, so it happens to be that at time t we can do action a at multiples times, a_t1, a_t2, etc. This is why we have the t notation.) Sample-averaging conceptually is the average reward we’ve received by taking an action a so far. This average value is subject to change; as we take action ‘a’ more and more, we have a better estimate of the true value of a and so Q(a) will change.
Greedy Action Selection Algorithm
This algorithm selects the action (which slot machine to roll) that is most likely to give us the highest long-term reward. It works by choosing the action that currently yields the greatest expected reward. This is all exploitation and no exploration.
a* = argmax(a in A) Q(a)
where Q(a) is the sample-average defined as Q(a) = E[ r_t | a_t ]
ε-Greedy Action Selection Algorithm
Select the action that currently yields the greatest expected reward most of the time but ε% of the time, choose a random, worse action to explore instead. This is supposed to give the best long-term cumulative reward.
a* = { argmax(a in A) Q(a) with prob 1- ε
{ any_action(a in A) with prob ε
where Q(a) is the sampling average defined as
Q(a) = E[ r_t | a_t ]
This is mainly exploitation with a bit of exploration.
Issues with ε-Greedy Action Selection Algorithm
- Imagine you’ve explored all actions and have sampled the actions enough times, that your action-values, Q(a) have converged, ie you a good sense of the values of all the actions. At this point, it would make sense to select the action with the greatest value and that’s it, ie just exploitation, no more exploration. However, the ε-Greedy Action Selection Algorithm would still randomly choose worse states even though we no longer need to do so.
- When we choose a worse action ε% of the time, we choose this worse action at random. Ideally, we would like to choose a specific worse action that seems to be promising as another action with the highest value.
Upper Confidence Bound (UCB) Algorithm
An algorithm that selects actions that will result in the highest long-term reward sum.
a* = argmax(a in A) [ Q(a) + UCB(a) ] = argmax(a in A) [ Q(a) + c ( ln(t)/Nt(a) )^.5 ]
where
Q(a) = E[ r_t | a_t ], estimated with average sampling
Nt(a) = the number of times that action ‘a’ has been selected, prior to time ‘t’.
c = a confidence value that controls the level of exploration.
The Q(a) term is responsible for exploitation and UCB(a) term is responsible for exploration.
If an action hasn’t been tried very often, or not at all, then Nₜ(a) will be small, causing the UCB(a) exploration term to be big and encouraging our agent to select this rarely-visited value. The more often we’ve taken an action, Nt(a) is larger, shrinking our exploration term, meaning it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).
Policy Function π(s)
π(s) tells the agent what action to take.
π(s) is a function to map state to action, π:s->a.
π*(s) = optimal policy function
Value Function V(s)
V(s) represents how good a state is
V(s) is a function to map state to utility, V:s->real #
V*(s) = optimal value function V^π(s) = value function given policy function π
Dynamic Programming (DP)
DP can solve MDPs when we know the transition function and reward function of the environment.
Dynamic programming is a method for solving complex problems by breaking them down into sub-problems and updating values recursively.
Monte Carlo Prediction
Estimate V(s) by sampling/trying out actions over many different episodes. More specifically compute the estimated utility from each episode, and then take the average utility. More formally,
V(s) = Σ G_i(s) * 1/N
N = number of episodes
G_i(s) = the estimated utility from the ith episode
An episode is where we have an initial state and repeatedly choose an action, do it, and get a reward, over and over again until we reach a terminal state.
Monte Carlo Prediction is used when the environment’s transition model and reward function is either unknown or too computationally expensive to use. So instead of doing computations with the reward function and the transition model in order to select a single action, we just take an action and record the observed reward. We actually try out the actions in Monte Carlo Prediction instead of sitting still and just trying to compute the best action. This is a Markov Decision Process.
Utility Estimate
Used in Monte Carlo Prediction and defined as
G_i(s_t) = Σ r_j * γ^(j-t+1) where Σ ranges from j=t-1 to T
= sum up the utilities in episode A starting at the first occurrence of state s
Continuous Updating in Monte Carlo Prediction
In normal Monte Carlo Prediction, we compute the utility estimate for all episodes and then compute a prediction for the value function V(s). In this version of Monte Carlo Prediction, we perform Monte Carlo Prediction but with one change: we update the V(s) estimate after computing the utility of each episode.
So if we already have an estimate for the value function, V(s), and just computed the utility of an additional episode, g_t, then the new estimate for V(s) is:
V(s) = (N * V(s) + g_t) / (N + 1)
N = number of episodes sampled including the new estimate
Derive Constant-𝛼 Monte Carlo Prediction
Continuous Updating in Monte Carlo Prediction introduced the idea of updating the value function estimate after computing the utility estimate of every episode. Let’s rearrange the equation from there:
V(s) = (N * V(s) + g_t) / (N + 1)
= (N * V(s) + g_t + V(s) - V(s)) / (N + 1)
= (V(s) * [N+1] + g_t - V(s)) / (N + 1)
= (V(s) * [N+1]) / (N + 1) + (g_t - V(s)) / (N + 1)
= V(s) + (g_t - V(s)) / (N + 1)
= V(s) + 1/(N + 1) * (g_t - V(s))
= V(s) + 𝛼 * (g_t - V(s))
= current_state + weight * new_minus_current