Exploration Flashcards

Question 1

Q

What is said (by Littman) to be the thing that separates reinforcement learning from all other types of machine learning?

Answer

A

Exploration

Question 2

Q

In the k bandits scenario should we use maximum likelihood to make our decisions? Why or why not? What about maximum confidence? Minimum confidence?

Answer

A

We shouldn’t use any of these strategies exclusively:
Maximum likelihood is bad because we may pull a sub optimal lever due to poor luck and never discover the better option
Maximum confidence is bad because we will never break from it (we’ll always pull the one we’re most confident in and that makes us more confident)
Minimum confidence is bad because we will learn more about the game but we’ll never use the information

Question 3

Q

What is the Gittins index (intuitively)? Does it work for everything?

Answer

A

Given your current information how high would a set payoff have to be to convince you to not try to gather more information.

No it only works for k-bandits

Question 4

Q

What are some example metrics for the k-bandits problem?

Answer

A

Identify optimal arm in the limit
Maximize discounted expected reward (Gittins Index)
Maximize expected reward over finite horizon
Identify near-optimal arm (e) with high probability (1-d) in time (poly) T(k,e,d)
Nearly maximize reward (e) with high probability (1-d) in time (poly) T(k,e,d)
Pull a non-near optimal arm (e) no more than T(k,e,d) times with high probability (1-d)

Question 5

Q

What is the Hoeffding bound?

Answer

A

It describes a confidence interval parametrized by some value d in (0,1) for an estimate of the true mean m based on n observations and our current estimate m’.

The interval is: [m’ - z_d/sqrt(n), m’ + z_d/sqrt(n)]
z_d = sqrt(0.5 ln(2/d))

Question 6

Q

What are some issues with random exploration as an optimization criteria/approach in MDPs? What would be better?

Answer

A

We may never reach a desired state
There may be trap states and our criteria should take this into account

Mistake bounding is better. That is the number of e-suboptimal action choices is bounded in Poly(1/e, 1/d, n, k) with probability 1-d

Question 7

Q

Describe the deterministic Rmax algorithm

Answer

A

Keep track of MDP
Any state-action pair is Rmax
Solve the MDP
Take action from Pi*

Question 8

Q

How do we guarantee Rmax will make no more mistakes for a deterministic MDP?

Answer

A

Visit every state

Question 9

Q

When should we stop visiting unknown states in Rmax for a deterministic MDP?

Answer

A

Once the discounted reward from a known state is better than any possible discounted rewards from the unknown states (assuming they have Rmax value)

Question 10

Q

How many mistakes could we make while trying to get more information in Rmax for a deterministic MDP?

Answer

A

n where n is the number of states

Question 11

Q

What is the maximum number of mistakes we can make with Rmax for a deterministic MDP?

Answer

A

k*n^2
n is number of states
k is number of actions

Question 12

Q

What is the lower bound for an algorithm solving a deterministic MDP?

Question 13

Q

Describe the GENERAL Rmax algorithm

Answer

A

Keep track of MDP
Any unknown state-action pair is Rmax. Define unknown as a state action pari which has been visited fewer than c times
Solve the MDP
Take action from Pi*

Question 14

Q

What is the explore or exploit lemma?

Answer

A

If all transitions are either accurately estimated or unknown and we optimize then the resulting “optimal” policy is either near-optimal or an unknown state will be reached quickly.

Question 15

Q

What is the simulation lemma?

Answer

A

If we have an alpha good estimate of the MDP then optimizing our rewards in the estimate will be alpha good in the real MDP as well

Question 16

Q

Briefly describe the KWIK algorithm class

Answer

Study These Flashcards

A

Knows what it knows algorithms track what the algorithm has learned and make affirmative answers when it knows exactly that it will be right and requests the correct answer (to learn from) when it does not

Exploration Flashcards

(16 cards)