0 - Terminology & Definitions Flashcards

Question

REINFORCE Algorithms

Answer 1

REINFORCE algorithms are a family of Reinforcement Learning algorithms which update their policy parameters according to the gradient of the policy with respect to the policy-parameters. The name is typically written using capital letters only, as it’s originally an acronym for the original algorithms group design: “REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility”

Answer 2

Reinforcement Learning is, like Supervised Learning and Unsupervised Learning, one the main areas of Machine Learning and Artificial Intelligence. It is concerned with the learning process of an arbitrary being, formally known as an Agent, in the world surrounding it, known as the Environment. The Agent seeks to maximize the rewards it receives from the Environment, and performs different actions in order to learn how the Environment responds and gain more rewards. One of the greatest challenges of RL tasks is to associate actions with postponed rewards — which are rewards received by the Agent long after the reward-generating action was made. It is therefore heavily used to solve different kind of games, from Tic-Tac-Toe, Chess, Atari 2600 and all the way to Go and StarCraft.

Answer 3

A numerical value received by the Agent from the Environment as a direct response to the Agent’s actions. The Agent’s goal is to maximize the overall reward it receives during an episode, and so rewards are the motivation the Agent needs in order to act in a desired behavior. All actions yield rewards, which can be roughly divided to three types: positive rewards which emphasize a desired action, negative rewards which emphasize an action the Agent should stray away from, and zero, which means the Agent didn’t do anything special or unique.

Answer 4

The Sarsa algorithm is pretty much the Q-Learning algorithm with a slight modification in order to make it an on-policy algorithm. The Q-Learning update rule is based on the Bellman equation for the optimal Q-Value, and so in the case on no uncertainties in state-transitions and expected rewards, the Q-Learning update rule is: Q(s,a) = r(s,a) + y\*max Q(s',a) In order to transform this into an on-policy algorithm, the last term is modified: Q(s,a) = r(s,a) + y\*max Q(s',a') when here, both actions a and a’ are chosen by the same policy. The name of the algorithm is derived from its update rule, which is based on (s,a,r,s’,a’), all coming from the same policy.

Answer 5

Every scenario the Agent encounters in the Environment is formally called a state. The Agent transitions between different states by performing actions. It is also worth mentioning the terminal states, which mark the end of an episode. There are no possible states after a terminal state has been reached, and a new episode begins. Quite often, a terminal state is represented as a special state where all actions transition to the same terminal state with reward 0.

Answer 6

Temporal Difference is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming. One of the simplest Temporal Difference algorithms it known as one-step TD or TD(0). It updates the Value Function according to the following update rule: where V is the Value Function, s is the state, r is the reward, γ is the discount factor, α is a learning rate, t is the time-step and the ‘=’ sign is used as an update operator and not equality. The term found in the squared brackets is known as the temporal difference error.

Answer 7

UCB is an exploration method which tries to ensure that each action is well explored. Consider an exploration policy which is completely random — meaning, each possible action has the same chance of being selected. There is a chance that some actions will be explored much more than others. The less an action is selected, the less confident the Agent can be about its expected reward, and the its exploitation phase might be harmed. Exploration by UCB takes into account the number of times each action was selected, and gives extra weight to those less-explored. Formalizing this mathematically, the selected action is picked by: ...where R(a) is the expected overall reward of action a, t is the number of steps taken (how many actions were selected overall), N(a) is the number of times action a was selected and c is a configureable hyperparameter. This method is also referred to sometimes as “exploration through optimism”, as it gives less-explored actions a higher value, encouraging the model to select them.

Answer 8

Usually denoted as V(s) (sometimes with a π subscript), the Value function is a measure of the overall expected reward assuming the Agent is in state s and then continues playing until the end of the episode following some policy π. It is defined mathematically as: ... ## Footnote While it does seem similar to the definition of Q Value, there is an implicit — yet important — difference: for n=0, the reward r⁰ of V(s) is the expected reward from just being in state s, before any action was played, while in Q Value, r⁰ is the expected reward after a certain action was played. This difference also yields the Advantage function.

Answer 9

Same as an agent. The system (like robots) that interacts and acts on the environment.

Answer 10

Playing out the whole sequence of state and action until reaching the terminate state or reaching a predefined length of actions.

Answer 11

The value function of a state V(s) is the total amount of expected rewards that an agent can collect from that state to the end of the episode.

Answer 12

The action-value function Q(s, a) is the total amount of expected rewards of taking an action from the state until the end of the episode.

Answer 13

A model describes how an environment may change upon an action from a state p(s’ | a, s). This is the system dynamics, the law of Physics, or the rule of the game. For a mechanical problem, it is the model dynamics on how things move.

Answer 14

Discount rate values the future rewards at the present value. It γ is smaller than one, we value future rewards less at the current value.

Answer 15

In model-free RL, the system dynamics is unknown or not needed to solve the task.

Answer 16

We use the known model or the learned model to plan the optimal controls in maximizing rewards. Or we can collect those sampled controls to train a policy and hope that it may be generalized to other tasks that we have not trained before.

Answer 17

Monte Carlo methods play through a completed episode. It computes the average of the sample returns from multiple episodes to estimate value functions. Or it uses the following running average to update the result.

Answer 18

We use the Monte Carlo method to evaluate the Q-value function of the current policy and find the optimal option by locating the action with the maximum Q-value functions.

Answer 19

Actor-critic combines the concept of Policy Gradient and Value-learning in solving an RL task. We optimize the actor which based on Policy Gradient to determine what actions to take based on observation. However, Policy Gradient often has a high variance of gradient that hurts convergence. We introduce a critic in evaluating a trajectory. This critic makes use of past sampled experience and other techniques that reduce variance. This allows the actor to be trained with better convergence.

Answer 20

We refine our policy for making actions more likely when we expect it to have large expected rewards (or vice versa).

Answer 21

It is similar to Policy Gradients. But Natural Policy Gradient uses a second-order optimization concept which is more accurate but more complex than the Policy Gradient which uses the first-order optimizer.

Answer 22

Sample efficiency relates to how many data samples needed in optimizing or finding the solution. Tasks requiring physical simulation can be expensive and therefore it is an important evaluation factor in selecting RL algorithms.

Answer 23

In on-policy learning, we optimize the current policy and use it to determine what spaces and actions to explore and sample next. Since the current policy is not optimized in early training, a stochastic policy will allow some form of exploration. Off-policy learning allows a second policy. This policy can be used to enhance how exploration is done. But its key purpose is to collect samples. Off-policy learning allows more controls on how we explore unknown and allows the use of older samples in the calculation. The later improves sample efficiency since we don’t need to recollect samples whenever a policy is changed.

Answer 24

It composes of states, actions, the model P, rewards and the discount factor. Our objective is to find a policy that maximizing the expected rewards.

Answer 25

Not all states are observable. If we have enough states information, we can solve the MDP using the states we have (π(s)). Otherwise, we have to derive our policy based on those observables (π(o)).

Answer 26

Instead of completing the whole episode like the Monte Carlo (calculating to the end or terminal state)... We rollout k-steps and collect the rewards. We estimate the value function based on the collected rewards and the value function after k-steps. Below is a 1-step TD learning. We find our the reward after taking one action. The equation below is a running average for V using TD.

Answer 27

See image:

Answer 28

We use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy. The difference between learning and planning is one from real experience generated by the environment and one from simulated experience by a model.

Answer 29

We learn the Q-value function by first taking an action (under a policy like epsilon-greedy) and observe the rewards R. Then we determine the next action with the best Q-value function.

Answer 30

Finding the best state and the action sequence that minimizing a cost function.

Answer 31

We observe the initial state of a system and plan actions to minimize a cost function.

Answer 32

We observe the initial state of a system and plan the actions. But during the course, we can observe the next state and readjust any actions. For a stochastic model, this allows us to readjust the response based on what actually occurred. Hence, a close-loop system can be optimized better than an open-loop system.

Answer 33

Optimize trajectory based on an open-loop system. Observe the initial (first) state & optimize the corresponding actions. For a stochastic system, this is suboptimal because we do not readjust the actions based on the observed states where we transit to.

Answer 34

Optimize trajectory based on a closed-loop system which we take actions based on the observed states. We manipulate both actions and states in optimizing the cost function.

Answer 35

Imitate what an expert may act. The expert can be a human or a program which produce quality samples for the model to learn and to generalize.

Answer 36

Try to model a reward function (for example, using a deep network) from expert demonstrations. So we can backpropagate rewards to improve policy.

Answer 37

See image:

Answer 38

In vector form... ...where g is the Jacobian matrix and H is the Hessian matrix.

Answer 39

In vector calculus, the Jacobian matrix (/dʒəˈkoʊbiən/,[1][2][3] /dʒɪ-, jɪ-/) of a vector-valued function in several variables is the matrix of all its first-order partial derivatives. The Jacobian matrix represents the differential of f at every point where f is differentiable

Answer 40

In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables.

Answer 41

In deep learning, we want a model predicting data distribution Q resembling the distribution P from the data. Such a difference between two probability distributions can be measured by KL Divergence.

Answer 42

Matrix A is positive-definite if z^TAz \> 0 for all real non-zero vectors z. Ie - A positive definite matrix is a symmetric matrix where every eigenvalue is positive. There is a vector z. This z will have a certain direction. When we multiply matrix M with z, z no longer points in the same direction. The direction of z is transformed by M. If M is a positive definite matrix, the new direction will always point in “the same general” direction (here “the same general” means less than π/2 angle change). It won’t reverse (= more than 90-degree angle change) the original direction. Why do we want PDM? Wouldn’t it be nice in an abstract sense… if you could multiply some matrices multiple times and they won’t change the sign of the vectors? If you multiply positive numbers to other positive numbers, it doesn’t change its sign. I think it’s a neat property for a matrix to have. Also, if the Hessian of a function is PSD, then the function is convex. (In calculus, the derivative must be zero at the maximum or minimum of the function. To know which, we check the sign of the second derivative. In multiple dimensions, we no longer have just one number to check, we have a matrix -Hessian. It’s a minimum if the Hessian is positive definite and a maximum if it’s negative definite.)

Answer 43

Stochastic policy models the real world better as we act on incomplete information. In addition, a small change of action can have a big swing on the rewards. The expected rewards will have a much steeper curvature with a deterministic policy. This destabilizes models easily and makes it vulnerable to noise. A stochastic policy will have a smoother surface with less sharp cliff since it tries out combinations of actions which smooth out the total rewards. So stochastic policy will work better with optimization methods like gradient descent.

0 - Terminology & Definitions Flashcards

(68 cards)