Exploration Flashcards
What is said (by Littman) to be the thing that separates reinforcement learning from all other types of machine learning?
Exploration
In the k bandits scenario should we use maximum likelihood to make our decisions? Why or why not? What about maximum confidence? Minimum confidence?
We shouldn’t use any of these strategies exclusively:
Maximum likelihood is bad because we may pull a sub optimal lever due to poor luck and never discover the better option
Maximum confidence is bad because we will never break from it (we’ll always pull the one we’re most confident in and that makes us more confident)
Minimum confidence is bad because we will learn more about the game but we’ll never use the information
What is the Gittins index (intuitively)? Does it work for everything?
Given your current information how high would a set payoff have to be to convince you to not try to gather more information.
No it only works for k-bandits
What are some example metrics for the k-bandits problem?
- Identify optimal arm in the limit
- Maximize discounted expected reward (Gittins Index)
- Maximize expected reward over finite horizon
- Identify near-optimal arm (e) with high probability (1-d) in time (poly) T(k,e,d)
- Nearly maximize reward (e) with high probability (1-d) in time (poly) T(k,e,d)
- Pull a non-near optimal arm (e) no more than T(k,e,d) times with high probability (1-d)
What is the Hoeffding bound?
It describes a confidence interval parametrized by some value d in (0,1) for an estimate of the true mean m based on n observations and our current estimate m’.
The interval is: [m’ - z_d/sqrt(n), m’ + z_d/sqrt(n)]
z_d = sqrt(0.5 ln(2/d))
What are some issues with random exploration as an optimization criteria/approach in MDPs? What would be better?
- We may never reach a desired state
- There may be trap states and our criteria should take this into account
Mistake bounding is better. That is the number of e-suboptimal action choices is bounded in Poly(1/e, 1/d, n, k) with probability 1-d
Describe the deterministic Rmax algorithm
- Keep track of MDP
- Any state-action pair is Rmax
- Solve the MDP
- Take action from Pi*
How do we guarantee Rmax will make no more mistakes for a deterministic MDP?
Visit every state
When should we stop visiting unknown states in Rmax for a deterministic MDP?
Once the discounted reward from a known state is better than any possible discounted rewards from the unknown states (assuming they have Rmax value)
How many mistakes could we make while trying to get more information in Rmax for a deterministic MDP?
n where n is the number of states
What is the maximum number of mistakes we can make with Rmax for a deterministic MDP?
k*n^2
n is number of states
k is number of actions
What is the lower bound for an algorithm solving a deterministic MDP?
k*n^2
Describe the GENERAL Rmax algorithm
- Keep track of MDP
- Any unknown state-action pair is Rmax. Define unknown as a state action pari which has been visited fewer than c times
- Solve the MDP
- Take action from Pi*
What is the explore or exploit lemma?
If all transitions are either accurately estimated or unknown and we optimize then the resulting “optimal” policy is either near-optimal or an unknown state will be reached quickly.
What is the simulation lemma?
If we have an alpha good estimate of the MDP then optimizing our rewards in the estimate will be alpha good in the real MDP as well