Reinforcement Learning Flashcards
Q: In a game, if all players were aware of their opponents’ strategies but could not increase their own reward by changing their strategy, this game is in a state known as the “Nash equilibrium”
A: True. If no one can gain more reward by changing their individual strategy, that is a Nash equilibrium (Video Lectures – Lesson 12)
Q: The “Pavlov” strategy is sub-game perfect with respect to the Prisoner’s Dilemma
A: True. If you feed two opposing players any combination of strategies, they would eventually cooperate repeatedly if they both employ the Pavlov strategy. (Video Lectures – Lesson 13)
Q: The Credit Assignment Problem refers to the state that gives the most reward on an MDP.
A: False, the Credit Assignment problem refers to the retrospective view of a trajectory when given a reward and the determination of which state/action is most responsible for the ultimate result.
Q: Model-based reinforcement learning is the use of supervised learning models such as neural networks to solve large state space RL problems.
A: False, model-based reinforcement learning is the process of iteratively building up a model in the learner and performing actions based on the current model’s transition function, reward function, and state space.
Q: Value iteration is one of the most important model-free reinforcement learning methods.
A: False, Value iteration is model-based.
Q: Off-policy agents learn the value of a policy different than the policy they are acting under.
A: True, off-policy examples include DQN or Q-learning which involve random actions or actions determined by previously stored values. The target policy differs from the behavior policy.
Q: Only non-linear function approximation has been proven to converge when used with the right hyper-parameters.
A: False, linear function approximation has been proven to converge, non-linear has not.
Q: POMDP are partially-observable because they are missing the MDP. An example of this is model-free reinforcement learning problems.
A: False, POMDP are partially-observable because the states are not mapped 1-1 with observations, that is observed environment does not uniquely determine a state.
Q: Grim Trigger strategy means a player will cooperate for the entire game regardless of other player’s action.
A: False, GT cooperates until other does not, then defects forever.
Q: Bayesian RL is a model-based approach that relies heavily in statistical methods such as bayes rule.
A: False, Bayesian RL is not necessarily model-based.
Q: DEC-POMDPs is a modeling framework for cooperative problems under uncertainty.
A: False, DEC-POMDP models are not cooperative in general.
Q: Model-based reinforcement learning agents can solve environments with continuous state variables because they are able to learn the transition and reward function.
A: False, [this applies when function approximation is used, but the] transition and reward function modeling does not imply continuous state variable modeling.
Q: TD(1) is equivalent to a K-Step estimator with K = 1.
A: False TD(0) is equivalent to a K-Step estimator with K = 1. TD(1) is equivalent to a K-Step estimator with K = inf
Q: Potential based reward shaping is used to indirectly shape the optimal policy by modifying rewards.
A: False: Potential based reward shaping (by design) improves learning efficiency but does not change the optimal policy. In Q-learning it is equivalent to having a good initial value for the Q function.
Q: An MDP reward function can be scaled, shifted by a constant, or augmented with non-linear potential-based rewards without changing the optimal policy.
A: False. The scale factor must be positive for it not to change the optimal policy.
Q: When exploring deterministic MDPs using the mistake bounded optimal algorithm, we assume any unknown state-action pair has a reward self loop of Rmax (equal to the largest reward seen so far) to ensure that every state-action pair is eventually explored.
A: False. While exploring the MDP we could become stuck in a strongly connected component of the MDP graph that does not include all of the states (i.e. some parts of the MDP may not be reachable based on prior actions)
Q: Policy Search continuously updates the policy directly via a value update. This update is based on the which you receive.
A: False, policy search updates the policy via policy updates.
Q: Following a plan and constantly checking if the action was successful (and changing the plan if it was not) is called conditional planning.
A: False, This is dynamic re-planning.
A: False, Conditional plan can have ‘if/else’ statements in the plan.
Q: V(s) can be expressed from Q(s,a) and vice versa
A: True
V(s) = maxa Q(s,a)
Q(s,a) = R(s,a) + ɣ ∑’s(T(s,a,s’) * V(s’))
Q: With potential-based learning, the agent receives higher rewards when it’s closer to the positive terminal state.
A: False, In potential-based learning, the environment designer adds additional rewards to states that will guide the agent to the desired terminal state. Those additional rewards are also deducted if the agent leaves those states.
Q: Current eligibility traces of past events are used with current TD errors to compute updates for TD(λ) backward view.
A: True, Future rewards and states are used to compute updates for TD(λ) forward view.
Q: TD(1) gives a maximum likelihood estimate.
A: False, TD(0) gives a maximum likelihood estimate
A: False, TD(1) gives the Monte Carlo estimate, which is the minimum mean squared error estimate.
Q: Temporal difference learning falls into the model-based learning.
A: False, TD is a class of model-free RL techniques that is value-based. It builds up estimates of the value incrementally.
Q: Experience / sample complexity relates to how much data is needed to converge on the answer.
A: True, This is one of the criteria for evaluating an agent.
Q: Q-Learning is on-policy because it might not use the selected action at to update the q-values.
A: False, This is why it’s off-policy
Q: potential-based shaping is equivalent to modifying initial Q-values. That is, the Q-values are the same.
A: False, The policy will be same but the actual Q-Values might be different.
Q: Markov games are a type of MDP.
A: False, MDP are a subset of markov games.