L8 - Exploring Exploration Flashcards

Question 1

Q

K-Armed Bandit

Answer

A

K agents, each with an arm to pull that has a probability of reward that is unknown

Question 2

Q

Maximum likelihood Strategy

Question 3

Q

What is the maximum confidence strategy and where does it fail.

Answer

A

The maximum liklihood strategy will always pick the agent or arm that has been chosen the most as we have the highest confidence of the outcome.

Question 4

Q

What is the Minimum Confidence strategy. How does it fail.

Answer

A

The Minimum Confidence is purely exploration and will always choose the bandit that has been chosen the least.

Question 5

Q

What are the metrics for bandits?

Answer

A

Bad metrics

Identify optimal arm in the limit.
Maximize the (discounted) expected reward.
1. Gittins Index - only works for bandit problems
Maximize expected reward over finite horizon.

Good metrics

Identify near optimal arm with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
nearly maximize rewward with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
Pull a non near optimal arm no more than t(k,epsilon,delta) with high probability (1- delta)

Question 6

Q

What is the R_Max algortihm

Answer

A

Question 7

Q

What is the General Rmax Algorithm

Answer

A

Keep track of the MDP
Any unkown state-action pair is assumed to be R-Max (maximum possible reward)
1. unkwon s,a if tried fewer than c times,
2. Then use Maximum Likelihod estimate
Solve the MDP
Tack action from π*

a

Question 8

Q

Hoeffding Bound

Question 9

Q

Simulation Lemma

Question 10

Q

Explore or Exploit Lemma

Answer

A

It all transistion are either accurately estimated or unknown, the optimal policy is either near optimal or an unkown state is reached quickly.