Class 6 - Guest lecture - Deep Reinforcement learning Flashcards
Difference between reinforcement learning deep learning
The difference between them is that deep learning is learning from a training set and then applying that learning to a new data set, while reinforcement learning is dynamically learning by adjusting actions based in continuous feedback to maximize a reward.
Traditional methods for robotics
work well for demos and narrow applications, but they don’t generalize well and require expensive and tedious adaptation to any new task or environment. (i.e., Boston robotics)
Deep Learning for Robotics
has proven effective to achieve (super)humans-level performance on many tasks:
- object detection and recognition of faces
- speech recognition
- dexterous manipulation
- still in the very early stages
How to apply DL to robotics options:
- “easy” fix (what is it about)?
- “harder” fix (what is it about)?
- easy fix: replace some components with neural networks, BUT we still have to engineer the entire system, and design (and train) the different components separately (issues that may arise: mistakes in pipeline, not great general movements)
- hard fix: end-to-end learning, automatic learning technique where the model learns all the steps between the initial input phase and the final output result. It takes the input and returns a distribution over action.
The reinforcement approach to learning a solution…(pick one):
A. uses simulations to train the agent
B. places the agent in an environment and lets it explore this environment by performing actions which will cause a new state and reward for the agent.
B
A solution to the fact that reinforcement learning generally requires a lot of time and a lot of repetitions is…
to use simulations to train thousands of agents in parallel
Although deep learning has proven to be more robust against perturbations during training / testing, it still reports one main issue, that is…
even the best simulations are too different from reality
reality gap (in the context of deep learning)
you might lose a lot when moving from simulation to reality
Fill in:
In the context of deep learning simulations, a small error compound at each time step might result in very - similar / different - trajectories between simulation and real world.
different
One approach that tries to solve the reality gap issue is the…
Sim2Real approach
The Sim2Real approach uses dynamics randomization to…
train robots in simulation using a wide range of physics (e.g., amount of gravity, size of each component of the robot, frictions, visual appearance and lights, etc…) to force the robot to work over many different environments, with the hope that the real world ends up included.
dis: very computationally expensive
One big problem in reinforcement learning is that (multiple picks are possible):
A. we have to design the reward function by hand
B. the reward function is always the same
C. the reward function we choose may not result in the behavior we want
A, C
The idea behind imitation learning is to…
collect demonstrations from humans solving the target task (in the demonstration phase), and use them to train an agent (training phase + test phase).
3 main approaches to imitation learning in the context of deep learning
- behavior cloning
- inverse reinforcement learning
- sequence modelling
In the context of imitation learning in deep learning, behavior cloning…
- treats the problem as supervised learning.
- Collects (state, action) pairs from many demonstration episodes.
- Trains a neural network to produce the same actions on the same states.
Main issues with behavior cloning
Since the agent tends to overfit the trajectories of the demonstrator, if the agent’s trajectory deviates from the
demonstrations, it can quickly diverge by compounding of
errors.
Solution to the behavior cloning problem
Do not train an agent to memorize a trajectory path, but more to learn to assign a certain probability to a trajectory. This probability will tell how likely this trajectory was taken by a human. If the probability is high, then the action was likely taken by a human, hence the robot will follow that action.
Incompliance mode
robot producing motor commands to its motor but also allows humans to move them.
In the inverse reinforcement learning approach…
the agent infers the reward function that generated the behavior of another agent (from human demonstrations) and attempts to reproduce the same behavior using the inferred reward function
Main issue with the inverse reinforcement learning approach
There are many reward functions that can explain the
same observed behavior, so how can we differentiate them?
In the sequence modeling approach…
the goal is to predict a full sequence of actions that leads to a sequence of high rewards. Specifically, we want to learn the probability distribution over the most successful sequence of actions.
Task specification
- The reward function determines what behavior is learnt.
2. Different reward functions can make learning easier or harder on the same task.
Reward hacking
occurs when an agent learns to exploit a poorly specified reward function to obtain high rewards by ‘cheating’.
Example of reward hacking
a coffee-making robot is incentivized to learn all the steps to make a cup of coffee. One of the steps that is rewarded is ‘turn on the coffee machine’. A naive implementation of the reward function may lead to the robot repeatedly turning the machine on and off to keep collecting the same reward multiple times.
A cooperation game shows NO nash equilibrium when…
one of the teams can improve its strategy without waiting for the other team to change its strategy
A competition game shows NO nash equilibrium when…
one of the teams can improve its strategy without waiting for the other team to change its strategy