L8 - Exploring Exploration Flashcards
K-Armed Bandit
K agents, each with an arm to pull that has a probability of reward that is unknown
Maximum likelihood Strategy
What is the maximum confidence strategy and where does it fail.
The maximum liklihood strategy will always pick the agent or arm that has been chosen the most as we have the highest confidence of the outcome.
What is the Minimum Confidence strategy. How does it fail.
The Minimum Confidence is purely exploration and will always choose the bandit that has been chosen the least.
What are the metrics for bandits?
Bad metrics
- Identify optimal arm in the limit.
- Maximize the (discounted) expected reward.
- Gittins Index - only works for bandit problems
- Maximize expected reward over finite horizon.
Good metrics
- Identify near optimal arm with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
- nearly maximize rewward with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
- Pull a non near optimal arm no more than t(k,epsilon,delta) with high probability (1- delta)
What is the R_Max algortihm
- Keep track of the MDP
- Any unkown state-action pair is assumed to be R-Max (maximum possible reward)
- Solve the MDP
- Tack action from π*
What is the General Rmax Algorithm
- Keep track of the MDP
- Any unkown state-action pair is assumed to be R-Max (maximum possible reward)
- unkwon s,a if tried fewer than c times,
- Then use Maximum Likelihod estimate
- Solve the MDP
- Tack action from π*
a
Hoeffding Bound
answer
Simulation Lemma
Answer
Explore or Exploit Lemma
It all transistion are either accurately estimated or unknown, the optimal policy is either near optimal or an unkown state is reached quickly.