Options Flashcards

Question 1

Q

What makes RL hard?

Answer

A

Delayed reward

Bootstrapping (we are doing estimates of estimates) - We need exploration

Question 2

Q

What is temporal abstraction?

Answer

A

Taking smaller actions and aggregating/abstracting them into larger actions.
e.g. instead of taking individual steps through a room take the action of moving through a room and through the door

Question 3

Q

What does temporal abstraction help with?

Answer

A

Temporal abstraction can help with the problem of delayed rewards by requiring less steps between taking actions and achieving rewards

Question 4

Q

What is a temporal abstraction option?

Answer

A

The combination of (I, Pi, and Beta), where
I is the initiation set of states
Pi is a policy mapping states to actions
Beta is the termination set of states

Question 5

Q

What is an SMDP?

Answer

A

A semi-Markov Decision Process. Instead of using a single step size, an SMDP is allowed to make larger, variable jumps using options. Once properly defined an SMDP can be treated as an MDP.

Question 6

Q

True/False - Temporal abstraction guarantees state space abstraction

Answer

A

False - It is possible that we can use temporal abstraction to abstract the state space, but it is not guaranteed

Question 7

Q

What are some benefits of temporal abstraction?

Answer

A

Temporally abstracted MDPs inherit optimality (including convergence and stability)
Allows us to ignore “boring” parts of the state
May allow for state abstraction (which makes the MDP significantly easier)

Question 8

Q

What is modular reinforcement learning?

Answer

A

A subfield of RL that focuses on arbitration processes using goal abstraction. That is, how to decide between parallel, competing goals (in predator prey these may be eat vs not get eaten)

Question 9

Q

What is greatest mass Q-learning?

Answer

A

Track multiple goals (each has a q function). Add all q functions and take the largest action

Question 10

Q

What is top Q-learning

Answer

A

Track multiple goals (each has a q function). Take the action with the highest q value (looking at all goals)

Question 11

Q

What is negotiated W-learning?

Answer

A

Track multiple goals/agents (each has a q function). The agent with the most to lose gets to choose the action i.e. look at difference between the best option and worst for each agent and let that one choose the action.

Question 12

Q

What is Arrow’s impossibility theorem?

Answer

A

It essentially says there is no way to design a fair voting system with multiple options. Compatibility between goals may not be guaranteed.

Question 13

Q

What is Monte Carlo Tree Search?

Answer

A

An algorithm for solving MDPs iteratively. Can be viewed as a policy search algorithm.

Select best action -> Expand based on actions -> Simulate using rollout policy -> Backup -> Select -> …

Question 14

Q

What is a way to improve rollout policies in MCTS?

Answer

A

Apply constraints which we expect to help better explore without additional knowledge (e.g. avoid getting eaten)

Question 15

Q

What are pros and cons (properties) of MCTS?

Answer

A

Pro - Useful for large state spaces, planning time is independent of number of states,
Con - Requires many samples to get a good estimate, running time is exponential in the horizon O( (|A|*steps)^Horizon )