Final Flashcards
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Anonymous
An MDP given a fixed policy is a Markov chain with rewards.
True since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.
f we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you don’t need the transition function.
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
True, assuming an infinite horizon the optimal policy will be unchanged
Markov means RL agents are amnesiacs and forget everything up until the current state.
True here: the current state is all the agent should know.
Now if you want to discuss what a “current state” is… Well that can get more complicated.
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
false. optimal action, on optimal value function, is by definition, optimal.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False. As pointed out, there are other things you might want to consider as well…
The value of the returned policy is the only metric we care about when evaluating a learner.
False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.
T/F POMDP allow us to strike a balance actions to gan reward and actions to gain information
TRUE this was all folded into one model (there was not special info to do this)
T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility
TRUE
DEC-POMDP stands for
Decentralized Partially Observable Markov Decisions
The primary difference between POMDPs and DEC-POMDPS
Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS
DEC-POMDPS vs POSG
DEC-POMDP all share reward function (they are working together)
T/F DEC-POMDPs are represented by Ri (diff reward for each agent)
FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG
Properties of DEC-POMDP
- Elements of game theory and POMDPS
2. NEXP-complete (for finite horizon)
Inverse RL
Input behavior and environment and OUTPUT reward
MLRL
Guess R, compute policy, measure probability of data given policy, compute gradient R to find how to change reward to make it fit the data better
In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?
Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)
Drama Management
There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi
TTD-MDPs vs MDPs
rather than states as, have trajectories (sequence so far) and rather than rewards, have target distribution p(T).
Value Iteration Algorithm
Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence
Supervised Learning
y = f(x). Function approximation, find f that maps y to x