3 - Multi-Armed Bandits Flashcards
In multiarm bandits, what is regret and why do we use it?
In MABs, there are multiple solutions to a problem, and usually, people measure regret in order to rank each solution. (Regret == simply put, the amount of penalty that we get for not pulling the optimal arm.). So to minimize regret we just have to pull the arm that has the highest probability of giving us a reward.
What are we balancing in MAB problems?
The idea behind the multi-arm bandit problem is how to optimally balance between exploration and exploitation.
How do we calculate Reward for MABs?
Here we are just counting how much reward we are getting for each arm, and dividing by the number of times we pulled each arm. (hence calculating the percentage of getting reward directly.)
What is a pure Greedy MAB? What do we initialize each arm’s probabilities to?
Here our strategy is just to pull the arm that gives us the most reward. However, the problem at the beginning is the fact that we don’t know how each arm’s probability is distributed.
So I just decided to give some ‘informative’ prior by initializing our estimate distribution from a uniform distribution.
What is an epsilon-Greedy MAB? Why is it better than a regular pure Greedy MAB?
Similar concept to Greedy method, but with some probability epsilon, we explore and select a random bandit rather than choosing the arm that we think has the highest probability. This raises our exploration over just pure exploitation.
What is epsilon-decay MAB, and why is it better than an epsilon MAB?
Exactly the same idea as the e-Greedy method, but here we are slowly decreasing the epsilon. (probability in which we explore). Hence in time, this method will be exactly the same as the Greedy method.
This allows us much exploration in the beginning and then we exploit things in the end.
What is epsilon here, what range of values can it take, and what do we do with it?
Epsilon is the probability that we pick an arm at random instead of automatically choosing an arm that has the highest probability. It is a value between 0 and 1. We then randomly sample a number between 0 and 1, and if it is less than our epsilon we pull a “random” arm, if it is greater than epsilon we pull the arm with the “greatest probability” of reward.
What is UCB1 algorithm? What does it stand for? Why is it better than other methods?
UCB1 stands for ‘Upper Confidence Bound’. It’s adage is that we ‘act optimisticaly in the face of uncertainty’. In other words, we prefer the arm that has been pulled less frequently, hence we are not so certain about that arm. It strongly favors exploration.
Why are the pros/cons or tradeoffs between MABs and A/B tests?
A/B tests are pure exploration and not exploitation. Because of this they take the longest time and are not practical in real-time decision making needs. However they can test for causality.
MABs on the other hand are not typically suited for causality (though if doen on causal graphs they can). They have an extra exploitation parameter built in such that they can be used in real-time decision making processes. Unfortunately, this also means they are reliant and sensitive to shorter term metrics (like CTR) rather than longer term goals.
What is Thompson sampling? What is another name for this? Why do we prefer this over other MAB designs?
Thompson Sampling or ‘Bayesian Bandits’ is an approach where we have set of prior distributions over each arm, and as we observe how much reward we are getting for each arm, we update our posterior. This means we can insert prior knowledge into our bandits and update that as time goes on.
What is SoftMax and how is it different than other MABs?
The Softmax method picks each arm with a probability that is proportional to its average reward. Hence if an arm gives a large reward, it has a higher probability of being selected.
One interesting idea related to this method is the fact that we have a temperature parameter that controls the degree of exploration.
What are Gradient Bandits and how are they different than other MABs?
Gradient bandit uses gradient ascent to find the optimal arm to pull. Simply put we have a variable called mean reward, that tracks the mean of reward until certain time t. And if the bandit we pulled gives higher reward then the mean, we increase the probability that the arms is chosen and vice versa if otherwise.
What do we plot when we compare bandit performance? Rank the bandits in terms of performance.
We can plot the Regret over time.
Quite a surprising result can be seen above, we can notice right away that simpler methods such as greedy and decay e-greedy methods are able to outperform more advanced solutions. (sampling methods took way much time.)
But when it comes to estimating the probability distribution of 1000 arms we can see that the softmax and UCB1 method gave us the most accurate estimations.
It seems like decay e-greedy method is the optimal solution overall, in both large and small experiment setting.
What is the context for contextual bandits?
Context is information about the situation or state of the environment. It is represented mathematically as x_t. In the statistical point of view, this is Covariance.
Why is it unreasonable to assume a single arm is the best for the entire population?
It is unreasonable because there is often there are many choices of behavior possible, and we tacitly assume that there is only one action that gives high rewards, or in other words that there is a single treatment (or ad or content) that is right for everybody which is not always the case.