Final Review pt. 4 Flashcards
True or False: The only algorithms that work in POMDPs are planning algorithms. Why?
False. RL algorithm also works for POMDP.
True or False: Problems that can be represented as POMDPs cannot be represented as MDPs. Why?
False. MDP is a special kind of POMDP.
True or False: Applying generalization with an “averager” on an MDP results in another MDP. Why?
True. Any generalization of an MDP results in another MDP
True or False: With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal. Why?
False. it may not even converge. Consider the Baird counterexample.
True or False: RL with linear function approximation will not work on environments having a continuous state space. Why?
True. Because a linear function approximation would fail to capture non-linearities and feature interactions in a continuous state space.
Let’s say you want to use Q-learning with some function approximator. Recall that we learned a convergence theorem and we used that to conclude that Q-learning converges. Can we apply that theorem to prove that your Q-learning with some function approximator converges? Why or why not?
False. Adding functional approximator might leads to divergence. Like we have seen in DQN for project 2.
Let’s say you want to use a function approximator like we learned in class. What function(s) are you approximating? What’s the input of that function and what’s the output of that function?
We can approximate action value, Q, which takes state and action pairs as inputs.
We learned about reward shaping in class. Could it be useful for solving Lunar Lander? If so, why and how?
Reward shaping could be useful given the high-dimensional state space and the agent is being trained to reach a particular point. I think it will accelerate the learning.
Observe that the biggest difference between P2’s Lunar Lander problem and HW4’s Taxi problem is that there are infinitely many states in Lunar Lander. What are some good methods to handle this case? What are their pros and cons?
DQN is very useful to handle high-dimensional state space such as Lunar Landing. Other methods include Model-based DreamerV2, imitation learning, and different policy gradient algorithms such as REINFORCE, PPO, A2C, and SAC [7]. While these algorithms provide superior accuracy, they are difficult to train because of non-convexity.