Reinforcement Learning Flashcards
What type is the observation (in the context of observation, reward, done, info)?
Object that represents the environment such as camera pixel data or joints of a robot)
What is a state?
S is a complete description of the world
What is an observation?
A partial description of a state (which may omit info)
How are states and observations represented?
Vectors, matrices or tensors (e.g. RGB matrix, joint angles, etc.)
What is fully observed?
When the agent is able to observe the complete state
What is action space?
The set of all valid actions in a given environment (think AlphaGo). They come in discrete and continuous where they are real-valued vectors.
What is a policy?
A policy is a rule used by an agent to decide what actions to take. The policy is the agent’s brain. Policies are parameterized (weights and biases)
What are the two types of stochastic policies and when are each used?
Categorical (discrete action spaces) and Diagonal Gaussian (continuous action spaces)
What is a trajectory?
A trajectory \tau is a sequence of states and actions in the world,. Trajectories can be called rollouts or episodes.
What does the reward function depend on?
state, action, next state
What is the difference between model-free and model-based?
Model-free does not have access to a model of the environment. It does not have a function that predicts state transitions and rewards like a model-based algorithm does.
What are the two main approaches to training in model-free? What is the difference?
Policy Optimization and Q-Learning. Policy optimization optimizes hyperparameters or the performance function (on-policy). Q-learning use an approximator to find the optimal action-value function (off-policy)
What are the Trade-offs Between Policy Optimization and Q-Learning?
The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training Q_{\theta} to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.
How would you set up Reset Free Learning?
propose learning a perturbation controller, which is trained with the goal of taking the agent to less explored states of the world. During training of the actual policy to perform the task, we alternate between running episodes of the perturbation controller and the policy, and train both the policy and the perturbation controller simultaneously.
What is a method to estimate the state and improve the speed of RL training
Observations of the environment taken from on-board sensors, for example an RGB camera, are often high-dimensional, which can make reinforcement learning difficult and slow. To address this, we utilize unsupervised representation learning techniques to condense images into latent features. Ideally, the latent features contain key information, while making the learning problem much easier. While many representation learning methods could be used, we explored the use of a variational encoder (Kingma et al., 2013) for feature learning.