Reinforcement Learning Flashcards

Question 1

Q

What type is the observation (in the context of observation, reward, done, info)?

Answer

A

Object that represents the environment such as camera pixel data or joints of a robot)

Question 2

Q

What is a state?

Answer

A

S is a complete description of the world

Question 3

Q

What is an observation?

Answer

A

A partial description of a state (which may omit info)

Question 4

Q

How are states and observations represented?

Answer

A

Vectors, matrices or tensors (e.g. RGB matrix, joint angles, etc.)

Question 5

Q

What is fully observed?

Answer

A

When the agent is able to observe the complete state

Question 6

Q

What is action space?

Answer

A

The set of all valid actions in a given environment (think AlphaGo). They come in discrete and continuous where they are real-valued vectors.

Question 7

Q

What is a policy?

Answer

A

A policy is a rule used by an agent to decide what actions to take. The policy is the agent’s brain. Policies are parameterized (weights and biases)

Question 8

Q

What are the two types of stochastic policies and when are each used?

Answer

A

Categorical (discrete action spaces) and Diagonal Gaussian (continuous action spaces)

Question 9

Q

What is a trajectory?

Answer

A

A trajectory \tau is a sequence of states and actions in the world,. Trajectories can be called rollouts or episodes.

Question 10

Q

What does the reward function depend on?

Answer

A

state, action, next state

Question 11

Q

What is the difference between model-free and model-based?

Answer

A

Model-free does not have access to a model of the environment. It does not have a function that predicts state transitions and rewards like a model-based algorithm does.

Question 12

Q

What are the two main approaches to training in model-free? What is the difference?

Answer

A

Policy Optimization and Q-Learning. Policy optimization optimizes hyperparameters or the performance function (on-policy). Q-learning use an approximator to find the optimal action-value function (off-policy)

Question 13

Q

What are the Trade-offs Between Policy Optimization and Q-Learning?

Answer

A

The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training Q_{\theta} to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

Question 14

Q

How would you set up Reset Free Learning?

Answer

A

propose learning a perturbation controller, which is trained with the goal of taking the agent to less explored states of the world. During training of the actual policy to perform the task, we alternate between running episodes of the perturbation controller and the policy, and train both the policy and the perturbation controller simultaneously.

Question 15

Q

What is a method to estimate the state and improve the speed of RL training

Answer

A

Observations of the environment taken from on-board sensors, for example an RGB camera, are often high-dimensional, which can make reinforcement learning difficult and slow. To address this, we utilize unsupervised representation learning techniques to condense images into latent features. Ideally, the latent features contain key information, while making the learning problem much easier. While many representation learning methods could be used, we explored the use of a variational encoder (Kingma et al., 2013) for feature learning.

Question 16

Q

What is classifier-based rewards?

Answer

A

To learn tasks with minimal human instrumentation in the learning process, we allow our learning system to assign itself reward based on a simple pre-provided specification of the desired task by a human operator. We allow the human operator to provide images that depict successful outcomes, as a means to specify the desired task. Given these images, we learn a success classifier and use the likelihood of this classifier to self-assign reward through the learning process. To train the classifier, the human provided examples of success are treated as positive examples and the policy data is treated as negative examples.

With the ingredients described above put together, we have a real world robotic RL (R3L) system which can learn tasks in environments without instrumentation or intervention. To train this system, the user just has to (1) provide success images of the task to be completed (2) leave the system to train unattended. Finally, (3), the learned policy is able to successfully perform the task from any start state.

Question 17

Q

What is the k-arm bandit?

Answer

A

fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain

Question 18

Q

What is the incremental update rule?

Answer

A

newEstimate

Question 19

Q

What is non-strationary bandit problem?

Answer

A

Distribution of awards changes over time

Question 20

Q

Why could greedy action be bad?

Answer

A

Based on estimates of what they think is best, but it is only an estimate. Exploration is best for long-term.

Question 21

Q

What is epilson-greedy?

Answer

A

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. Explores E % of the time.

Question 22

Q

What are optimistic initial values?

Answer

A

Initialize the reward higher which drives early exploration, hard to know where to set

Question 23

Q

What is upper-confidence action selection?

Answer

A

Picking the option with the highest confidence for its upper-confidence bounch. Combines exploration and exploitation.

Question 24

Q

Real-world RL shift in priorities?

Answer

A

Temporal credit vs Generalization, environmental controls, features instead of state, last policy vs every policy

Question 25

Q

What is the action value?

Answer

A

The expected reward when taking that action

Question 26

Q

What is the markov decision process?

Answer

A

modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. they are a probability distribution. can be used for sequential decision making.

Question 27

Q

Give examples of S, A, r for a pick and place robot

Answer

A

S: joint angles and velocities
A: voltage applied to motor
r: +100 when placed, -1 for energy consumed

Question 28

Q

What is the dif b/w episodic and continuing tasks?

Answer

A

episode -> ends in termination, next espiode doesn’t not depend on previous episode

continuous is like a thermostat, no terminal state

Question 29

Q

How are continuous tasks returns finite?

Answer

A

Discounting