Final Flashcards
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Anonymous
An MDP given a fixed policy is a Markov chain with rewards.
True since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.
f we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you don’t need the transition function.
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
True, assuming an infinite horizon the optimal policy will be unchanged
Markov means RL agents are amnesiacs and forget everything up until the current state.
True here: the current state is all the agent should know.
Now if you want to discuss what a “current state” is… Well that can get more complicated.
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
false. optimal action, on optimal value function, is by definition, optimal.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False. As pointed out, there are other things you might want to consider as well…
The value of the returned policy is the only metric we care about when evaluating a learner.
False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.
T/F POMDP allow us to strike a balance actions to gan reward and actions to gain information
TRUE this was all folded into one model (there was not special info to do this)
T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility
TRUE
DEC-POMDP stands for
Decentralized Partially Observable Markov Decisions
The primary difference between POMDPs and DEC-POMDPS
Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS
DEC-POMDPS vs POSG
DEC-POMDP all share reward function (they are working together)
T/F DEC-POMDPs are represented by Ri (diff reward for each agent)
FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG
Properties of DEC-POMDP
- Elements of game theory and POMDPS
2. NEXP-complete (for finite horizon)
Inverse RL
Input behavior and environment and OUTPUT reward
MLRL
Guess R, compute policy, measure probability of data given policy, compute gradient R to find how to change reward to make it fit the data better
In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?
Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)
Drama Management
There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi
TTD-MDPs vs MDPs
rather than states as, have trajectories (sequence so far) and rather than rewards, have target distribution p(T).
Value Iteration Algorithm
Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence
Supervised Learning
y = f(x). Function approximation, find f that maps y to x
Unsupervised Learning
f(c). Clusters descripition
Reinforcement Learning
y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z
MDP stands for
Markov Decision Processes.
MDPs are made up of
States, transitions (model), actions, rewards and create a policy.
Markovian Property
Only the present matters AND things are stationary (rules/the world doesn’t change over time)
Delayed Rewards
In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward
Temporal Credit Assignment Problem
the problem of determining the actions that lead to a certain outcome in sequence
How to change policies to account for finite horizons
Policy(s,t).. policy is function of state AND time
Utility of Sequences (stationary preferences)
if you prefer one sequence of states over another today, you prefer the same sequence tomorrow
How to calculate infinite horizons without infinity?
Use discounted future rewards (use gamma)
Reward vs Utility
Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward
Policy Iteration
start with initial policy (guess)
evaluate: given policy, calculate utility
improve: Policy at t+1 is the action that maximizes the utility
T/F Policy Iteration won’t converge
False
On Policy vs Off-Policy
off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning)
on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy’s action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)
T/F TD update rule always converges with any learning rate
False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity
T/F TD(1) is the same as outcome-based updates (if no repeated states)
True. and with even more learning because updates don’t have to wait for the episode to be over
Maximum Likelihood Estimate vs outcome-based estimate (TD(1))
Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)
T/F TD(0) is the same as maximum likelihood estimate
TRUE if we run over the data over and over again
T/F TD(lambda) is weighted combination of k step estimators
True
T/F TD(1) typically has less error than TD(0)
False. TD(1) typically has more error than TD(0)
T/F TD(0) has the least amount of error usually
False. TD(lambda) performs best. Usually 0.3 - 0.7 is best
Temporal Difference is
Difference between reward (value estimates) as we go from one step to another
T/F reward must be scalar
TRUE
T/F environment is visible to the agent
False, usually