Reinforcement Learning Flashcards

1
Q

What type is the observation (in the context of observation, reward, done, info)?

A

Object that represents the environment such as camera pixel data or joints of a robot)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a state?

A

S is a complete description of the world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an observation?

A

A partial description of a state (which may omit info)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are states and observations represented?

A

Vectors, matrices or tensors (e.g. RGB matrix, joint angles, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is fully observed?

A

When the agent is able to observe the complete state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is action space?

A

The set of all valid actions in a given environment (think AlphaGo). They come in discrete and continuous where they are real-valued vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a policy?

A

A policy is a rule used by an agent to decide what actions to take. The policy is the agent’s brain. Policies are parameterized (weights and biases)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two types of stochastic policies and when are each used?

A

Categorical (discrete action spaces) and Diagonal Gaussian (continuous action spaces)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a trajectory?

A

A trajectory \tau is a sequence of states and actions in the world,. Trajectories can be called rollouts or episodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the reward function depend on?

A

state, action, next state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between model-free and model-based?

A

Model-free does not have access to a model of the environment. It does not have a function that predicts state transitions and rewards like a model-based algorithm does.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two main approaches to training in model-free? What is the difference?

A

Policy Optimization and Q-Learning. Policy optimization optimizes hyperparameters or the performance function (on-policy). Q-learning use an approximator to find the optimal action-value function (off-policy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the Trade-offs Between Policy Optimization and Q-Learning?

A

The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training Q_{\theta} to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How would you set up Reset Free Learning?

A

propose learning a perturbation controller, which is trained with the goal of taking the agent to less explored states of the world. During training of the actual policy to perform the task, we alternate between running episodes of the perturbation controller and the policy, and train both the policy and the perturbation controller simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a method to estimate the state and improve the speed of RL training

A

Observations of the environment taken from on-board sensors, for example an RGB camera, are often high-dimensional, which can make reinforcement learning difficult and slow. To address this, we utilize unsupervised representation learning techniques to condense images into latent features. Ideally, the latent features contain key information, while making the learning problem much easier. While many representation learning methods could be used, we explored the use of a variational encoder (Kingma et al., 2013) for feature learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is classifier-based rewards?

A

To learn tasks with minimal human instrumentation in the learning process, we allow our learning system to assign itself reward based on a simple pre-provided specification of the desired task by a human operator. We allow the human operator to provide images that depict successful outcomes, as a means to specify the desired task. Given these images, we learn a success classifier and use the likelihood of this classifier to self-assign reward through the learning process. To train the classifier, the human provided examples of success are treated as positive examples and the policy data is treated as negative examples.

With the ingredients described above put together, we have a real world robotic RL (R3L) system which can learn tasks in environments without instrumentation or intervention. To train this system, the user just has to (1) provide success images of the task to be completed (2) leave the system to train unattended. Finally, (3), the learned policy is able to successfully perform the task from any start state.

17
Q

What is the k-arm bandit?

A

fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain

18
Q

What is the incremental update rule?

A

newEstimate

19
Q

What is non-strationary bandit problem?

A

Distribution of awards changes over time

20
Q

Why could greedy action be bad?

A

Based on estimates of what they think is best, but it is only an estimate. Exploration is best for long-term.

21
Q

What is epilson-greedy?

A

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. Explores E % of the time.

22
Q

What are optimistic initial values?

A

Initialize the reward higher which drives early exploration, hard to know where to set

23
Q

What is upper-confidence action selection?

A

Picking the option with the highest confidence for its upper-confidence bounch. Combines exploration and exploitation.

24
Q

Real-world RL shift in priorities?

A

Temporal credit vs Generalization, environmental controls, features instead of state, last policy vs every policy

25
Q

What is the action value?

A

The expected reward when taking that action

26
Q

What is the markov decision process?

A

modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. they are a probability distribution. can be used for sequential decision making.

27
Q

Give examples of S, A, r for a pick and place robot

A

S: joint angles and velocities
A: voltage applied to motor
r: +100 when placed, -1 for energy consumed

28
Q

What is the dif b/w episodic and continuing tasks?

A

episode -> ends in termination, next espiode doesn’t not depend on previous episode

continuous is like a thermostat, no terminal state

29
Q

How are continuous tasks returns finite?

A

Discounting