L8 - Exploring Exploration Flashcards

1
Q

K-Armed Bandit

A

K agents, each with an arm to pull that has a probability of reward that is unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Maximum likelihood Strategy

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the maximum confidence strategy and where does it fail.

A

The maximum liklihood strategy will always pick the agent or arm that has been chosen the most as we have the highest confidence of the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Minimum Confidence strategy. How does it fail.

A

The Minimum Confidence is purely exploration and will always choose the bandit that has been chosen the least.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the metrics for bandits?

A

Bad metrics

  1. Identify optimal arm in the limit.
  2. Maximize the (discounted) expected reward.
    1. Gittins Index - only works for bandit problems
  3. Maximize expected reward over finite horizon.

Good metrics

  1. Identify near optimal arm with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
  2. nearly maximize rewward with high probability (1- delta) in time t(k,epsilon,delta) 0 poly time - PAC / PAO
  3. Pull a non near optimal arm no more than t(k,epsilon,delta) with high probability (1- delta)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the R_Max algortihm

A
  1. Keep track of the MDP
  2. Any unkown state-action pair is assumed to be R-Max (maximum possible reward)
  3. Solve the MDP
  4. Tack action from π*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the General Rmax Algorithm

A
  1. Keep track of the MDP
  2. Any unkown state-action pair is assumed to be R-Max (maximum possible reward)
    1. unkwon s,a if tried fewer than c times,
    2. Then use Maximum Likelihod estimate
  3. Solve the MDP
  4. Tack action from π*

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hoeffding Bound

A

answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Simulation Lemma

A

Answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explore or Exploit Lemma

A

It all transistion are either accurately estimated or unknown, the optimal policy is either near optimal or an unkown state is reached quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly