Options Flashcards
What makes RL hard?
Delayed reward
Bootstrapping (we are doing estimates of estimates) - We need exploration
What is temporal abstraction?
Taking smaller actions and aggregating/abstracting them into larger actions.
e.g. instead of taking individual steps through a room take the action of moving through a room and through the door
What does temporal abstraction help with?
Temporal abstraction can help with the problem of delayed rewards by requiring less steps between taking actions and achieving rewards
What is a temporal abstraction option?
The combination of (I, Pi, and Beta), where
I is the initiation set of states
Pi is a policy mapping states to actions
Beta is the termination set of states
What is an SMDP?
A semi-Markov Decision Process. Instead of using a single step size, an SMDP is allowed to make larger, variable jumps using options. Once properly defined an SMDP can be treated as an MDP.
True/False - Temporal abstraction guarantees state space abstraction
False - It is possible that we can use temporal abstraction to abstract the state space, but it is not guaranteed
What are some benefits of temporal abstraction?
- Temporally abstracted MDPs inherit optimality (including convergence and stability)
- Allows us to ignore “boring” parts of the state
- May allow for state abstraction (which makes the MDP significantly easier)
What is modular reinforcement learning?
A subfield of RL that focuses on arbitration processes using goal abstraction. That is, how to decide between parallel, competing goals (in predator prey these may be eat vs not get eaten)
What is greatest mass Q-learning?
Track multiple goals (each has a q function). Add all q functions and take the largest action
What is top Q-learning
Track multiple goals (each has a q function). Take the action with the highest q value (looking at all goals)
What is negotiated W-learning?
Track multiple goals/agents (each has a q function). The agent with the most to lose gets to choose the action i.e. look at difference between the best option and worst for each agent and let that one choose the action.
What is Arrow’s impossibility theorem?
It essentially says there is no way to design a fair voting system with multiple options. Compatibility between goals may not be guaranteed.
What is Monte Carlo Tree Search?
An algorithm for solving MDPs iteratively. Can be viewed as a policy search algorithm.
Select best action -> Expand based on actions -> Simulate using rollout policy -> Backup -> Select -> …
What is a way to improve rollout policies in MCTS?
Apply constraints which we expect to help better explore without additional knowledge (e.g. avoid getting eaten)
What are pros and cons (properties) of MCTS?
Pro - Useful for large state spaces, planning time is independent of number of states,
Con - Requires many samples to get a good estimate, running time is exponential in the horizon O( (|A|*steps)^Horizon )