C5 Flashcards
model-based methods
the agent first build its own internal transition model from the environment feedback and uses this local model to find out about the effects of actions on states and rewards. The agent can then generate policy updates from the internal model (planning), without causing changes to the environment
advantage of model-based
the agent has its own model of the state transitions of the world, so it can learn the best policy for free, without further incurring the cost of acting in the environment => low sample complexity
disadvantage of model-based
the learned transition function may be inaccurate and thus the resulting policy of low quality => uncertainty and model bias
model-based planning and learning
- the agent uses the Q function as behaviour policy to sample the new state and reward from the environment and update the policy
- the agent records the new state and reward in a local transition and reward function. we can now choose to sample from the (cheap) local transition function or from the (expensive) environment transition function
the goal of model-based methods
- to solve larger and more complex problems in the same amount of time (lower sample complexity and a deeper understanding of the environment)
- that the generalization power improves so much that new classes of problems can be solved
why does the model-based agent learn the local transition function?
then once the accuracy of this local function is good enough, it can sample from it to improve the policy, without incurring the cost of actual environment samples
what is Dyna?
a hybrid approach, between model-based and model-free learning
imagination: apart from updates by planning using the local transition function, it also uses environment samples to update the policy directly
what happens when we turn on planning in model-based learning?
for each environment sample, we perform N planning steps and the planning amplifies any useful reward information that the agent has learned from the environment, and plows it back into the policy function quickly
how can we reduce the uncertainty of the transition model in model-based methods?
- increasing the number of environment samples
- use Gaussian processes, where the dynamics model is learned by giving an estimate of the function and of the uncertainty of the function with a covariance matrix on the entire dataset
- use ensemble methods
latent models
goal: dimensionality reduction
idea: in most high-dimensional problems some elements are less important => abstract these away from the model
what is Model-Predictive Control (MPC)?
the model is optimized for a limited time into the future, and then it is re-learned after each environment step. In this way small errors do not get a chance to accumulate and influence the outcome greatly.
name 3 methods that perform planning with a neural network
- Value Iteration Networks (VIN)
- TreeQN
- Predictron
Value Iteration Networks (VIN)
- differentiable multi-layer convolutional networks used for planning in Grid worlds
- uses backpropagation to learn the value iteration parameters, allowing it to navigate unseen environments
- can generalize to unknown transition probabilites
TreeQN
extended version of VIN, uses observation abstraction to handle irregular shapes
end-to-end learning for planning
hand-crafted planning algorithms are replaced by differentiable approaches, so the system can learn to plan and make decisions directly from raw input data