w 10 multi arm bandit theory Flashcards
what is bandit mean in multi armed bandit
the name comes from the slot machines in casinos, they were called bandits
Each machine has a different payout rate, but you don’t know what those rates are. Your goal is to figure out which machine pays out the most over time while also winning as much as possible.
what is the multi-armed bandit theory thats used in reinforcment learning
used for problems where agent repeately chooses from a set of actions to maxmize the overall reward without considering state stransitions
what is the markov decision process used in reinforcment learning
involved sequential decision making, where the agent interacts with an environment that transitions betweens sttes based on probabilistic dynamics and the chosen actions
what are the two main actions you need to balence in multi armed bandit problem
exploration - collecting more info, but miss out on consistent higher rewards
exploitation - capitalizing on ur knowledge, but might miss a better reward
recall standard normal distrbituion densitity distbrutbjion
the normal standard one where mean is 0 and std is 1
recall the exponential distribution
decays expoentially
mean = 1/lamda
variance = 1/lamda^2
recall the law of large numbers
as the sample size increases the sample mean gets more accurate in representing the entire population
ie average becoems stable
what is the epsilon greedy approach
most of the time it exploites the best action but
then it explores occasionally with probaility epsilon
however it gets less effecive over time because it explores all options forever regardless of their profit
what is the upper confidence bound algorithm
The UCB algorithm is an optimistic approach to decision-making in multi-armed bandit problems.
It selects the bandit with the highest score, where the score is determined by:
The average reward from past trials.
A confidence bound that reflects the uncertainty or variance in the estimates.
This balances exploration (trying less-sampled bandits) and exploitation (choosing the bandit with the best-known reward).
how do you determine the ucb function
it shold be high when N is small ie we only have a few samples
and small when n is large
it is important to find a meaningful function U
what do we use to find a useful UBC function
We use Hoeffding’s Inequality to derive a confidence bound:
It provides an upper bound on the probability that the sum of bounded, independent random variables deviates from its expected value.
This helps measure how much the average reward of a bandit deviates from its true expected value, enabling the calculation of a reliable UCB score.
is ubc or e greedy better
generally ubc is better but if you can tune epsilon good then e greedy can outperform ubc
but more commonly its best to use ubc
it deals with alternatives
you can add varients while its running
and it dyanmically adjusts the time we spend on each bandit
what is thompson sampling
Thompson Sampling is a probabilistic algorithm for solving multi-armed bandit problems that balances exploration vs exploitation
Initialize: Set a prior distribution for each arm’s reward (e.g., Beta distribution for binary rewards).
Sample: Draw a random sample from each arm’s posterior distribution.
Select: Choose the arm with the highest sample value.
Update: Update the posterior distribution of the chosen arm based on the observed reward.
Key Features:
Uses a Bayesian framework to model uncertainty in rewards.
Automatically balances learning and reward maximization.
Simple, efficient, and widely used in applications like online ads, recommendation systems, and clinical trials.
how can we measure the performance of an bandit algorithm
we use regret to see how good an algorithm performed
the difference between the (expected) performance of an algorithm and the best possible performance (the potential)
the max you could have won is 5000
but ur stupid and u only got 3200
so ur regret is 5000- 3200 = 1800
what is a no regret algorithm
if the average regret of the algorithm converges to 0 when t approaches infinity
ex if the avg regret is 1/T and T goes to infinity then the value is 0