w 10 multi arm bandit theory Flashcards

Question 1

Q

what is bandit mean in multi armed bandit

Answer

A

the name comes from the slot machines in casinos, they were called bandits

Each machine has a different payout rate, but you don’t know what those rates are. Your goal is to figure out which machine pays out the most over time while also winning as much as possible.

Question 2

Q

what is the multi-armed bandit theory thats used in reinforcment learning

Answer

A

used for problems where agent repeately chooses from a set of actions to maxmize the overall reward without considering state stransitions

Question 3

Q

what is the markov decision process used in reinforcment learning

Answer

A

involved sequential decision making, where the agent interacts with an environment that transitions betweens sttes based on probabilistic dynamics and the chosen actions

Question 4

Q

what are the two main actions you need to balence in multi armed bandit problem

Answer

A

exploration - collecting more info, but miss out on consistent higher rewards
exploitation - capitalizing on ur knowledge, but might miss a better reward

Question 5

Q

recall standard normal distrbituion densitity distbrutbjion

Answer

A

the normal standard one where mean is 0 and std is 1

Question 6

Q

recall the exponential distribution

Answer

A

decays expoentially
mean = 1/lamda
variance = 1/lamda^2

Question 7

Q

recall the law of large numbers

Answer

A

as the sample size increases the sample mean gets more accurate in representing the entire population
ie average becoems stable

Question 8

Q

what is the epsilon greedy approach

Answer

A

most of the time it exploites the best action but

then it explores occasionally with probaility epsilon

however it gets less effecive over time because it explores all options forever regardless of their profit

Question 9

Q

what is the upper confidence bound algorithm

Answer

A

The UCB algorithm is an optimistic approach to decision-making in multi-armed bandit problems.

It selects the bandit with the highest score, where the score is determined by:
The average reward from past trials.
A confidence bound that reflects the uncertainty or variance in the estimates.

This balances exploration (trying less-sampled bandits) and exploitation (choosing the bandit with the best-known reward).

Question 10

Q

how do you determine the ucb function

Answer

A

it shold be high when N is small ie we only have a few samples

and small when n is large

it is important to find a meaningful function U

Question 11

Q

what do we use to find a useful UBC function

Answer

A

We use Hoeffding’s Inequality to derive a confidence bound:

It provides an upper bound on the probability that the sum of bounded, independent random variables deviates from its expected value.

This helps measure how much the average reward of a bandit deviates from its true expected value, enabling the calculation of a reliable UCB score.

Question 12

Q

is ubc or e greedy better

Answer

A

generally ubc is better but if you can tune epsilon good then e greedy can outperform ubc

but more commonly its best to use ubc

it deals with alternatives
you can add varients while its running
and it dyanmically adjusts the time we spend on each bandit

Question 13

Q

what is thompson sampling

Answer

A

Thompson Sampling is a probabilistic algorithm for solving multi-armed bandit problems that balances exploration vs exploitation

Initialize: Set a prior distribution for each arm’s reward (e.g., Beta distribution for binary rewards).
Sample: Draw a random sample from each arm’s posterior distribution.
Select: Choose the arm with the highest sample value.
Update: Update the posterior distribution of the chosen arm based on the observed reward.
Key Features:
Uses a Bayesian framework to model uncertainty in rewards.
Automatically balances learning and reward maximization.
Simple, efficient, and widely used in applications like online ads, recommendation systems, and clinical trials.

Question 14

Q

how can we measure the performance of an bandit algorithm

Answer

A

we use regret to see how good an algorithm performed

the difference between the (expected) performance of an algorithm and the best possible performance (the potential)

the max you could have won is 5000
but ur stupid and u only got 3200

so ur regret is 5000- 3200 = 1800

Question 15

Q

what is a no regret algorithm

Answer

A

if the average regret of the algorithm converges to 0 when t approaches infinity

ex if the avg regret is 1/T and T goes to infinity then the value is 0

Question 16

Q

how do you find the average regret

Answer

A

devide the number by time steps, ie measure our regret on average at each step

Question 17

Q

pros of a no regret algorithm

Answer

A

on average we will always pull the best arm in a no regret algortihm! so it will start behaving like the best possible algorithm quickly! which is our goal

Question 18

Q

what are the regrets of each algorithm for bandit problems

Answer

A

e greedy regret grows linearly

decaying e greedy is sub linear

ucb has a logitarthim regret

thompson sampling has a better regret bound than ucb
- it followed the underlying distrbution/ assume it