3 - Multi-Armed Bandits Flashcards by Mundy Reimer

In multiarm bandits, what is regret and why do we use it?

In MABs, there are multiple solutions to a problem, and usually, people measure regret in order to rank each solution. (Regret == simply put, the amount of penalty that we get for not pulling the optimal arm.). So to minimize regret we just have to pull the arm that has the highest probability of giving us a reward.

How well did you know this?

Not at all

Perfectly

What are we balancing in MAB problems?

The idea behind the multi-arm bandit problem is how to optimally balance between exploration and exploitation.

How well did you know this?

Not at all

Perfectly

How do we calculate Reward for MABs?

Here we are just counting how much reward we are getting for each arm, and dividing by the number of times we pulled each arm. (hence calculating the percentage of getting reward directly.)

How well did you know this?

Not at all

Perfectly

What is a pure Greedy MAB? What do we initialize each arm’s probabilities to?

Here our strategy is just to pull the arm that gives us the most reward. However, the problem at the beginning is the fact that we don’t know how each arm’s probability is distributed.

So I just decided to give some ‘informative’ prior by initializing our estimate distribution from a uniform distribution.

How well did you know this?

Not at all

Perfectly

What is an epsilon-Greedy MAB? Why is it better than a regular pure Greedy MAB?

Similar concept to Greedy method, but with some probability epsilon, we explore and select a random bandit rather than choosing the arm that we think has the highest probability. This raises our exploration over just pure exploitation.

How well did you know this?

Not at all

Perfectly

What is epsilon-decay MAB, and why is it better than an epsilon MAB?

Exactly the same idea as the e-Greedy method, but here we are slowly decreasing the epsilon. (probability in which we explore). Hence in time, this method will be exactly the same as the Greedy method.

This allows us much exploration in the beginning and then we exploit things in the end.

How well did you know this?

Not at all

Perfectly

What is epsilon here, what range of values can it take, and what do we do with it?

Epsilon is the probability that we pick an arm at random instead of automatically choosing an arm that has the highest probability. It is a value between 0 and 1. We then randomly sample a number between 0 and 1, and if it is less than our epsilon we pull a “random” arm, if it is greater than epsilon we pull the arm with the “greatest probability” of reward.

How well did you know this?

Not at all

Perfectly

What is UCB1 algorithm? What does it stand for? Why is it better than other methods?

UCB1 stands for ‘Upper Confidence Bound’. It’s adage is that we ‘act optimisticaly in the face of uncertainty’. In other words, we prefer the arm that has been pulled less frequently, hence we are not so certain about that arm. It strongly favors exploration.

How well did you know this?

Not at all

Perfectly

Why are the pros/cons or tradeoffs between MABs and A/B tests?

A/B tests are pure exploration and not exploitation. Because of this they take the longest time and are not practical in real-time decision making needs. However they can test for causality.

MABs on the other hand are not typically suited for causality (though if doen on causal graphs they can). They have an extra exploitation parameter built in such that they can be used in real-time decision making processes. Unfortunately, this also means they are reliant and sensitive to shorter term metrics (like CTR) rather than longer term goals.

How well did you know this?

Not at all

Perfectly

What is Thompson sampling? What is another name for this? Why do we prefer this over other MAB designs?

Thompson Sampling or ‘Bayesian Bandits’ is an approach where we have set of prior distributions over each arm, and as we observe how much reward we are getting for each arm, we update our posterior. This means we can insert prior knowledge into our bandits and update that as time goes on.

How well did you know this?

Not at all

Perfectly

What is SoftMax and how is it different than other MABs?

The Softmax method picks each arm with a probability that is proportional to its average reward. Hence if an arm gives a large reward, it has a higher probability of being selected.

One interesting idea related to this method is the fact that we have a temperature parameter that controls the degree of exploration.

How well did you know this?

Not at all

Perfectly

What are Gradient Bandits and how are they different than other MABs?

Gradient bandit uses gradient ascent to find the optimal arm to pull. Simply put we have a variable called mean reward, that tracks the mean of reward until certain time t. And if the bandit we pulled gives higher reward then the mean, we increase the probability that the arms is chosen and vice versa if otherwise.

How well did you know this?

Not at all

Perfectly

What do we plot when we compare bandit performance? Rank the bandits in terms of performance.

We can plot the Regret over time.

Quite a surprising result can be seen above, we can notice right away that simpler methods such as greedy and decay e-greedy methods are able to outperform more advanced solutions. (sampling methods took way much time.)

But when it comes to estimating the probability distribution of 1000 arms we can see that the softmax and UCB1 method gave us the most accurate estimations.

It seems like decay e-greedy method is the optimal solution overall, in both large and small experiment setting.

How well did you know this?

Not at all

Perfectly

What is the context for contextual bandits?

Context is information about the situation or state of the environment. It is represented mathematically as x_t. In the statistical point of view, this is Covariance.

How well did you know this?

Not at all

Perfectly

Why is it unreasonable to assume a single arm is the best for the entire population?

It is unreasonable because there is often there are many choices of behavior possible, and we tacitly assume that there is only one action that gives high rewards, or in other words that there is a single treatment (or ad or content) that is right for everybody which is not always the case.

How well did you know this?

Not at all

Perfectly

Why is selection bias an issue in multi-armed bandits?

This is because we know that in stochastic MABs, the sample mean of the arm is not necessarily an unbiased estimator of its true mean. In other words, selection bias would shift this sample mean of the arm greatly.

What does the online in online contextual bandits refer to (hint - it is not the interwebs!)?

The “online” refers to incremental or sequential based learning where we are learning as the data comes in (such as in streaming or in real-time), versus “offline” in which you have a static, finite set amount of data that you train on

What are two limitations of traditional A/B testing?

The two limitations are long lead time and high costs to exploration.

What the tension between the short timeframe of MAB and longer timeframe of business value?

The MAB is often chosen over A/B test because it can start to improve business value on a short timeframe such as in the case of reallocation of traffic, but the tradeoff with this is that MABs usually require metrics that are sensitive to changes in the short period of time (ex - clickthrough rates) and hence may not correlate with a longer-term metric such as typically found in the evaluation of an A/B test.

What does Recommendation Unified Data set (RUDS) do?

RUDs use a specified look ahead window to join back-end recommendation services and front-end click and view events.

What is special about RUDs?

RUDs are special because they not only capture the micro-conversion events (like clicks and views) which are online metrics, but also the ultimate end-goal conversion event (like the purchase) which is an offline event. This is usually found in such business domains as ad revenue in attempting to tie a clicked ad or button to the end goal of a purchase of a product, or more generally can be found in recommender systems.

What is a Multi-Objective Optimization (MOO) framework?

As the name suggests, the MOO framework concerns optimization problems where there is more than one target or objective function to be optimized simultaneously, such as tradeoffs between two or more metrics where some sort of balance is sought.

For the “Thompson Sampling for Infinite-Armed Bandits” algorithm, what is the aim and what happens in practice?

The aim of this particular algorithm is to find the true maximum of the sampled function, which is itself a vectorized version of the function values at n points, with a multivariate normal distribution with mean 0 and covariance k. In practice, we usually only find this over a grid of values.

Why was MAP estimation chosen instead of MLE estimation?

In MAP estimation (Maximum A Posteriori), we usually make the choice that is most likely given the observed data, and as such fits nicely with Baye’s Rule. This allows our estimate to take into account prior knowledge about our expected parameters in the form of a prior probability distribution, unlike in the MLE framework which uses only the likelihood.

In contextual bandits, what is Reward a function of?

Reward ( Context, Action )...and not Reward(Context(Action))

What is a problem with contextual bandits with Context? How do we solve this? What does this remind you of?

The problem with Context is that it is a sparse matrix because we don't have all the combinations that inform the Context. So in practice we either cluster or make personas to reduce the number of contexts associated with each action. This is basically Recommender Systems.