C5 Flashcards

Question 1

Q

model-based methods

Answer

A

the agent first build its own internal transition model from the environment feedback and uses this local model to find out about the effects of actions on states and rewards. The agent can then generate policy updates from the internal model (planning), without causing changes to the environment

Question 2

Q

advantage of model-based

Answer

A

the agent has its own model of the state transitions of the world, so it can learn the best policy for free, without further incurring the cost of acting in the environment => low sample complexity

Question 3

Q

disadvantage of model-based

Answer

A

the learned transition function may be inaccurate and thus the resulting policy of low quality => uncertainty and model bias

Question 4

Q

model-based planning and learning

Answer

A

the agent uses the Q function as behaviour policy to sample the new state and reward from the environment and update the policy
the agent records the new state and reward in a local transition and reward function. we can now choose to sample from the (cheap) local transition function or from the (expensive) environment transition function

Question 5

Q

the goal of model-based methods

Answer

A

to solve larger and more complex problems in the same amount of time (lower sample complexity and a deeper understanding of the environment)
that the generalization power improves so much that new classes of problems can be solved

Question 6

Q

why does the model-based agent learn the local transition function?

Answer

A

then once the accuracy of this local function is good enough, it can sample from it to improve the policy, without incurring the cost of actual environment samples

Question 7

Q

what is Dyna?

Answer

A

a hybrid approach, between model-based and model-free learning

imagination: apart from updates by planning using the local transition function, it also uses environment samples to update the policy directly

Question 8

Q

what happens when we turn on planning in model-based learning?

Answer

A

for each environment sample, we perform N planning steps and the planning amplifies any useful reward information that the agent has learned from the environment, and plows it back into the policy function quickly

Question 9

Q

how can we reduce the uncertainty of the transition model in model-based methods?

Answer

A

increasing the number of environment samples
use Gaussian processes, where the dynamics model is learned by giving an estimate of the function and of the uncertainty of the function with a covariance matrix on the entire dataset
use ensemble methods

Question 10

Q

latent models

Answer

A

goal: dimensionality reduction
idea: in most high-dimensional problems some elements are less important => abstract these away from the model

Question 11

Q

what is Model-Predictive Control (MPC)?

Answer

A

the model is optimized for a limited time into the future, and then it is re-learned after each environment step. In this way small errors do not get a chance to accumulate and influence the outcome greatly.

Question 12

Q

name 3 methods that perform planning with a neural network

Answer

A

Value Iteration Networks (VIN)
TreeQN
Predictron

Question 13

Q

Value Iteration Networks (VIN)

Answer

A

differentiable multi-layer convolutional networks used for planning in Grid worlds
uses backpropagation to learn the value iteration parameters, allowing it to navigate unseen environments
can generalize to unknown transition probabilites

Question 14

Q

TreeQN

Answer

A

extended version of VIN, uses observation abstraction to handle irregular shapes

Question 15

Q

end-to-end learning for planning

Answer

A

hand-crafted planning algorithms are replaced by differentiable approaches, so the system can learn to plan and make decisions directly from raw input data

Question 16

Q

what is MuZero?

Answer

Study These Flashcards

A

a new architecture to learn transition functions for chess, shogi and Go. It can learn the rules of Atari games and board games

Question 17

Q

what is the dynamics model?

Answer

Study These Flashcards

A

the combination of transition function and reward

Question 18

Q

what is the difference between planning and learning?

Answer

Study These Flashcards

A

with learning you change the environment, which you cannot undo and with planning you can do reversable actions

Question 19

Q

how can we improve the weakness of model-based?

Answer

Study These Flashcards

A

weakness: when the local model is not accurate

improve the model
improve the planning procedure

Question 20

Q

how can we improve the model?

Answer

Study These Flashcards

A

uncertainty modelling
latent models

Question 21

Q

how can we improve planning?

Answer

Study These Flashcards

A

Model-Predictive Control
end-to-end planning and learning

Question 22

Q

what is the biggest drawback of MuZero?

Answer

Study These Flashcards

A

it requires a lot of computation

Question 23

Q

what is so wonderful about MuZero?

Answer

Study These Flashcards

A

it can find out the rules of multiple games by itself

C5 Flashcards

(23 cards)