RL: Chapter 2: Mutli-armed Bandits Flashcards

Question 1

Q

Original form of the

k-armed bandit problem

Answer

A

You are faced repeatedly with a choice among k different options, or actions.

After each choice, you receive a numerical reward chosen from a stationary distribution that depends on the action you selected.

Your objective is to maximize the expected total reward over some time period. E.g. over 1000 action selections or time steps.

Question 2

Q

Greedy actions

Answer

A

The action whose estimate value is greatest at a time step.

Question 3

Q

Exploiting vs Exploring

Answer

A

You are exploiting your current knowledge when you select one of the greedy actions.

You are exploring when you select a nongreedy action, as it enables you to improve your estimate of the nongreedy action’s value.

Question 4

Q

ε-greedy methods

Answer

A

Methods that behave greedily most of the time, but every once in a while, with small probability ε, instead select randomly from among all the actions with equal probability, independently of the action-value estimates.

Question 5

Q

Associative search task

Answer

A

A task that involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best.

A.k.a. contextual bandits

Question 6

Q

Full Reinforcement Learning Problem

Answer

A

Tasks in which the action is allowed to affect the next situation as well as the reward.

RL: Chapter 2: Mutli-armed Bandits Flashcards

(6 cards)