Week 3 Flashcards

Question 1

Q

Sequential Descion Problem

Answer

A

A problem where the agent’s utility depends on a sequence of decisions. This problem is not episodic, but sequential, meaning the current decision could affect all future decisions. Sequential does not mean that the current decision is made based upon all past decisions.

Question 2

Q

Markov Decision Process (MDP)

Answer

A

A sequential decision problem that is fully observable, stochastic, additive

MDP provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

MDPs are usually solved with dynamic programming techniques such as value iteration, policy iteration, linear programming, Monte-Carlo planning.

Question 3

Q

Multi-armed Bandit Problem

Answer

A

Consider an agent taking many possible actions, one after another, each with an unknown reward. The agent wants to maximize the reward it gets from taking all of these actions by balancing exploration (trying to understand which action is the best) and exploitation (selecting the action that at the moment has the highest reward). This is a type of reinforcement learning problem.

Famous ex: Suppose we have x slot machines, each with a different reward and different probabilities of winning that reward. We want to roll y slot machines such that we end up with the greatest amount of money. The agent has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.

Question 4

Q

Action Value Function (Sample-Averaging)

Answer

A

The action value function, Q(a), maps an action ‘a’ to a real number representing the value we obtain from taking this action. In the multi-armed bandit problem, we do not know the true value of taking the action and so we simply estimate the value of an action with sample-averaging.

Sample-averaging means we define Q(a) as:
Q(a) = E[r_t | a_t]
where r_t is the reward obtained at time t and a_t is the action taken at time t. (We take one action and obtain one reward at every instant in time, so it happens to be that at time t we can do action a at multiples times, a_t1, a_t2, etc. This is why we have the t notation.) Sample-averaging conceptually is the average reward we’ve received by taking an action a so far. This average value is subject to change; as we take action ‘a’ more and more, we have a better estimate of the true value of a and so Q(a) will change.

Question 5

Q

Greedy Action Selection Algorithm

Answer

A

This algorithm selects the action (which slot machine to roll) that is most likely to give us the highest long-term reward. It works by choosing the action that currently yields the greatest expected reward. This is all exploitation and no exploration.

a* = argmax(a in A) Q(a)

where Q(a) is the sample-average defined as 
Q(a) = E[ r_t | a_t ]

Question 6

Q

ε-Greedy Action Selection Algorithm

Answer

A

Select the action that currently yields the greatest expected reward most of the time but ε% of the time, choose a random, worse action to explore instead. This is supposed to give the best long-term cumulative reward.

a* = { argmax(a in A) Q(a) with prob 1- ε
{ any_action(a in A) with prob ε

where Q(a) is the sampling average defined as
Q(a) = E[ r_t | a_t ]
This is mainly exploitation with a bit of exploration.

Question 7

Q

Issues with ε-Greedy Action Selection Algorithm

Answer

A

Imagine you’ve explored all actions and have sampled the actions enough times, that your action-values, Q(a) have converged, ie you a good sense of the values of all the actions. At this point, it would make sense to select the action with the greatest value and that’s it, ie just exploitation, no more exploration. However, the ε-Greedy Action Selection Algorithm would still randomly choose worse states even though we no longer need to do so.
When we choose a worse action ε% of the time, we choose this worse action at random. Ideally, we would like to choose a specific worse action that seems to be promising as another action with the highest value.

Question 8

Q

Upper Confidence Bound (UCB) Algorithm

Answer

A

An algorithm that selects actions that will result in the highest long-term reward sum.

a* = argmax(a in A) [ Q(a) + UCB(a) ]
    = argmax(a in A) [ Q(a) + c ( ln(t)/Nt(a) )^.5 ]

where
Q(a) = E[ r_t | a_t ], estimated with average sampling
Nt(a) = the number of times that action ‘a’ has been selected, prior to time ‘t’.
c = a confidence value that controls the level of exploration.

The Q(a) term is responsible for exploitation and UCB(a) term is responsible for exploration.

If an action hasn’t been tried very often, or not at all, then Nₜ(a) will be small, causing the UCB(a) exploration term to be big and encouraging our agent to select this rarely-visited value. The more often we’ve taken an action, Nt(a) is larger, shrinking our exploration term, meaning it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).

Question 9

Q

Policy Function π(s)

Answer

A

π(s) tells the agent what action to take.

π(s) is a function to map state to action, π:s->a.

π*(s) = optimal policy function

Question 10

Q

Value Function V(s)

Answer

A

V(s) represents how good a state is

V(s) is a function to map state to utility, V:s->real #

V*(s) = optimal value function
V^π(s) = value function given policy function π

Question 11

Q

Dynamic Programming (DP)

Answer

A

DP can solve MDPs when we know the transition function and reward function of the environment.

Dynamic programming is a method for solving complex problems by breaking them down into sub-problems and updating values recursively.

Question 12

Q

Monte Carlo Prediction

Answer

A

Estimate V(s) by sampling/trying out actions over many different episodes. More specifically compute the estimated utility from each episode, and then take the average utility. More formally,
V(s) = Σ G_i(s) * 1/N
N = number of episodes
G_i(s) = the estimated utility from the ith episode

An episode is where we have an initial state and repeatedly choose an action, do it, and get a reward, over and over again until we reach a terminal state.

Monte Carlo Prediction is used when the environment’s transition model and reward function is either unknown or too computationally expensive to use. So instead of doing computations with the reward function and the transition model in order to select a single action, we just take an action and record the observed reward. We actually try out the actions in Monte Carlo Prediction instead of sitting still and just trying to compute the best action. This is a Markov Decision Process.

Question 13

Q

Utility Estimate

Answer

A

Used in Monte Carlo Prediction and defined as
G_i(s_t) = Σ r_j * γ^(j-t+1) where Σ ranges from j=t-1 to T
= sum up the utilities in episode A starting at the first occurrence of state s

Question 14

Q

Continuous Updating in Monte Carlo Prediction

Answer

A

In normal Monte Carlo Prediction, we compute the utility estimate for all episodes and then compute a prediction for the value function V(s). In this version of Monte Carlo Prediction, we perform Monte Carlo Prediction but with one change: we update the V(s) estimate after computing the utility of each episode.

So if we already have an estimate for the value function, V(s), and just computed the utility of an additional episode, g_t, then the new estimate for V(s) is:
V(s) = (N * V(s) + g_t) / (N + 1)
N = number of episodes sampled including the new estimate

Question 15

Q

Derive Constant-𝛼 Monte Carlo Prediction

Answer

A

Continuous Updating in Monte Carlo Prediction introduced the idea of updating the value function estimate after computing the utility estimate of every episode. Let’s rearrange the equation from there:
V(s) = (N * V(s) + g_t) / (N + 1)
= (N * V(s) + g_t + V(s) - V(s)) / (N + 1)
= (V(s) * [N+1] + g_t - V(s)) / (N + 1)
= (V(s) * [N+1]) / (N + 1) + (g_t - V(s)) / (N + 1)
= V(s) + (g_t - V(s)) / (N + 1)
= V(s) + 1/(N + 1) * (g_t - V(s))
= V(s) + 𝛼 * (g_t - V(s))
= current_state + weight * new_minus_current

Question 16

Q

Constant-𝛼 Monte Carlo Prediction

Answer

Study These Flashcards

A

Monte Carlo Prediction [estimate V(s) by sampling actions and computing the average estimated utility of each episode] but update the V(s) estimate after computing the utility estimate for each episode. When we update V(s), we weigh the change in the value of V(s) by alpha. Mathematically,
V(s) = V(s) + 𝛼 * (g_t - V(s))
g_t = utility estimate of the new episode

Question 17

Q

Temporal Difference Update Derivation

Answer

Study These Flashcards

A

Let’s further update Constant-𝛼 Monte Carlo Prediction which updates the estimate of the value function V(s) after every episode and weighs the change by 𝛼:
V(s_t) = V(s_t) + 𝛼 * (g_t - V(s_t))
However, in Temporal Difference Update, we assume we cannot compute g_t directly and must use Policy Evaluation by pretending that g_t, utility estimate of the t-th episode, is in it of itself a value function like V~(s) and then compute
g_t = r_t+1 + γ * V(s_t+1)
so that the final Temporal Difference Update equation is:
V(s_t) = V(s_t) + 𝛼 * [r_t+1 + γ * V(s_t+1) - V(s_t)]

Question 18

Q

Temporal Difference Update

Answer

Study These Flashcards

A

A version of Constant-𝛼 Monte Carlo Prediction that tries to estimate V(s) by sampling actions but with sequential, not episodic, data. Mathematically:
V(s_t) = V(s_t) + 𝛼 * [r_t+1 + γ * V(s_t+1) - V(s_t)]

This is like DP but with sampling instead of going through all states. This is like MC but with boot-strapping, using current V(s) values to update the next V(s’) value.

Question 19

Q

Temporal Difference Update Algorithm

Answer

Study These Flashcards

A

paste later

Question 20

Q

Q-Value

Answer

Study These Flashcards

A

The Q in Q-value stands for quality because the Q-value represents the quality (usefulness) of an action
in gaining some future reward.

Week 3 Flashcards

(20 cards)