w 10 multi arm bandit theory Flashcards

1
Q

what is bandit mean in multi armed bandit

A

the name comes from the slot machines in casinos, they were called bandits

Each machine has a different payout rate, but you don’t know what those rates are. Your goal is to figure out which machine pays out the most over time while also winning as much as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the multi-armed bandit theory thats used in reinforcment learning

A

used for problems where agent repeately chooses from a set of actions to maxmize the overall reward without considering state stransitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the markov decision process used in reinforcment learning

A

involved sequential decision making, where the agent interacts with an environment that transitions betweens sttes based on probabilistic dynamics and the chosen actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the two main actions you need to balence in multi armed bandit problem

A

exploration - collecting more info, but miss out on consistent higher rewards
exploitation - capitalizing on ur knowledge, but might miss a better reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

recall standard normal distrbituion densitity distbrutbjion

A

the normal standard one where mean is 0 and std is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

recall the exponential distribution

A

decays expoentially
mean = 1/lamda
variance = 1/lamda^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

recall the law of large numbers

A

as the sample size increases the sample mean gets more accurate in representing the entire population
ie average becoems stable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the epsilon greedy approach

A

most of the time it exploites the best action but

then it explores occasionally with probaility epsilon

however it gets less effecive over time because it explores all options forever regardless of their profit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the upper confidence bound algorithm

A

The UCB algorithm is an optimistic approach to decision-making in multi-armed bandit problems.

It selects the bandit with the highest score, where the score is determined by:
The average reward from past trials.
A confidence bound that reflects the uncertainty or variance in the estimates.

This balances exploration (trying less-sampled bandits) and exploitation (choosing the bandit with the best-known reward).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how do you determine the ucb function

A

it shold be high when N is small ie we only have a few samples

and small when n is large

it is important to find a meaningful function U

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what do we use to find a useful UBC function

A

We use Hoeffding’s Inequality to derive a confidence bound:

It provides an upper bound on the probability that the sum of bounded, independent random variables deviates from its expected value.

This helps measure how much the average reward of a bandit deviates from its true expected value, enabling the calculation of a reliable UCB score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

is ubc or e greedy better

A

generally ubc is better but if you can tune epsilon good then e greedy can outperform ubc

but more commonly its best to use ubc

it deals with alternatives
you can add varients while its running
and it dyanmically adjusts the time we spend on each bandit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is thompson sampling

A

Thompson Sampling is a probabilistic algorithm for solving multi-armed bandit problems that balances exploration vs exploitation

Initialize: Set a prior distribution for each arm’s reward (e.g., Beta distribution for binary rewards).
Sample: Draw a random sample from each arm’s posterior distribution.
Select: Choose the arm with the highest sample value.
Update: Update the posterior distribution of the chosen arm based on the observed reward.
Key Features:
Uses a Bayesian framework to model uncertainty in rewards.
Automatically balances learning and reward maximization.
Simple, efficient, and widely used in applications like online ads, recommendation systems, and clinical trials.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how can we measure the performance of an bandit algorithm

A

we use regret to see how good an algorithm performed

the difference between the (expected) performance of an algorithm and the best possible performance (the potential)

the max you could have won is 5000
but ur stupid and u only got 3200

so ur regret is 5000- 3200 = 1800

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a no regret algorithm

A

if the average regret of the algorithm converges to 0 when t approaches infinity

ex if the avg regret is 1/T and T goes to infinity then the value is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how do you find the average regret

A

devide the number by time steps, ie measure our regret on average at each step

17
Q

pros of a no regret algorithm

A

on average we will always pull the best arm in a no regret algortihm! so it will start behaving like the best possible algorithm quickly! which is our goal

18
Q

what are the regrets of each algorithm for bandit problems

A

e greedy regret grows linearly

decaying e greedy is sub linear

ucb has a logitarthim regret

thompson sampling has a better regret bound than ucb
- it followed the underlying distrbution/ assume it