RL: Chapter 2: Mutli-armed Bandits Flashcards
Original form of the
k-armed bandit problem
You are faced repeatedly with a choice among k
different options, or actions.
After each choice, you receive a numerical reward chosen from a stationary distribution that depends on the action you selected.
Your objective is to maximize the expected total reward over some time period. E.g. over 1000 action selections or time steps.
Greedy actions
The action whose estimate value is greatest at a time step.
Exploiting vs Exploring
You are exploiting your current knowledge when you select one of the greedy actions.
You are exploring when you select a nongreedy action, as it enables you to improve your estimate of the nongreedy action’s value.
ε-greedy methods
Methods that behave greedily most of the time, but every once in a while, with small probability ε, instead select randomly from among all the actions with equal probability, independently of the action-value estimates.
Associative search task
A task that involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best.
A.k.a. contextual bandits
Full Reinforcement Learning Problem
Tasks in which the action is allowed to affect the next situation as well as the reward.