Deep learning for control in robotics Flashcards
Lecture 12
What is direct perception?
Use deep learning to model perception and use manual policy for actions.
What is the main issue in using expert demonstrations?
Errors in expert data will also be present in learned behavior.
The policy learned will be reactive without long-term goals.
What is the procedure behind DAgger and what is the motivation?
Train policy with expert demonstrations gather data with trained policy, label new data with expert, train new policy on gathered/labeled data.
To combat compounding errors
Explain Guided Policy Search (GPS)
A way to both do trajectory optimization and also policy search. You basically sample from your trajectory distribution, collect some trajectories, and then train two things. First, training your policy supervised with these trajectories, and secondly, fit the dynamics of the environment. Then you use both these things to update your trajectory distribution. You optimize the trajectory distribution to minimize cost, c(T), to maximize entropy (the variety of the trajectory distribution), and finally constrain it not to deviate too far away from your policy.
What is the main point in Inverse Reinforcement learning?
Reward function is not known and you learn this based on expert demonstrations.
What is the problem with maximum entropy IRL?
Its objective is to maximize the probability of the policy sampling trajectories close to the expert demonstrations. The problem is the partition function. This requires us to have the state visitation frequency which is exponentially large when you have a large state-space or uknown dynamics.
What is the main difference between MaxEnt IRL and Guided cost learning?
The main difference is that instead of a complete estimate of the partition function (the likelyhood of visiting a state over the whole state-space), we sample from a proposal distribution (actually we sample from the policy distribution).
What are the steps in guided cost learning?
We sample from our policy, then update the reward function using these samples and the expert trajectories. In more detail, the samples are going to be used to estimate the partition function (Z, which was hard to compute analytically because of large state-space) and then the second part of the loss function. The demonstrations are used to evaluate the first term in the gradient.
Describe the main similarities between Guided Cost Learning and Generative Adversarial Networks.
While GCL has trajectories, GAN has samples. The policy of the GCL corresponds the generator in a GAN and the reward function (which maximizes the probability of the expert trajectories) corresponds to the discriminator in a GAN.