0 - Terminology & Definitions Flashcards

1
Q

Actions

A

Actions are the Agent’s methods which allow it to interact and change its environment, and thus transfer between states. Every action performed by the Agent yields a reward from the environment. The decision of which action to choose is made by the policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Actor-Critic

A

When attempting to solve a Reinforcement Learning problem, there are two main methods one can choose from: calculating the Value Functions or Q-Values of each state and choosing actions according to those, or directly compute a policy which defines the probabilities each action should be taken depending on the current state, and act according to it. Actor-Critic algorithms combine the two methods in order to create a more robust method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantage Function

A

Usually denoted as A(s,a), the Advantage function is a measure of how much is a certain action a good or bad decision given a certain state — or more simply, what is the advantage of selecting a certain action from a certain state. It is defined mathematically as:

A(s,a) = E[r(s,a) - r(s)]

where r(s,a) is the expected reward of action a from state s, and r(s) is the expected reward of the entire state s, before an action was selected. It can also be viewed as:

A(s,a) = Q(s,a) - V(s)

where Q(s,a) is the Q Value and V(s) is the Value function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Agent

A

The learning and acting part of a Reinforcement Learning problem, which tries to maximize the rewards it is given by the Environment. Putting it simply, the Agent is the model which you try to design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bandits

A

Formally named “k-Armed Bandits” after the nickname “one-armed bandit” given to slot-machines, these are considered to be the simplest type of Reinforcement Learning tasks. Bandits have no different states, but only one — and the reward taken under consideration is only the immediate one. Hence, bandits can be thought of as having single-state episodes. Each of the k-arms is considered an action, and the objective is to learn the policy which will maximize the expected reward after each action (or arm-pulling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Contextual Bandits

A

Contextual Bandits are a slightly more complex task, where each state may be different and affect the outcome of the actions — hence every time the context is different. Still, the task remains a single-state episodic task, and one context cannot have an influence on others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bellman Equation

A

Formally, Bellman equation defines the relationships between a given state (or state-action pair) to its successors. While many forms exist, the most common one usually encountered in Reinforcement Learning tasks is the Bellman equation for the optimal Q-Value , which is given by:

…or when no uncertainty exists (meaning, probabilities are either 1 or 0):

Q*(s,a) = r(s,a) + y max(Q*(s’,a’)

where the asterisk sign indicates optimal value. Some algorithms, such as Q-Learning, are basing their learning procedure over it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous Tasks

A

Reinforcement Learning tasks which are not made of episodes, but rather last forever. This tasks have no terminal states. For simplicity, they are usually assumed to be made of one never-ending episode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Deep Reinforcement Learning

A

The use of a Reinforcement Learning algorithm with a deep neural network as an approximator for the learning part. This is usually done in order to cope with problems where the number of possible states and actions scales fast, and an exact solution in no longer feasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Discount Factor (γ)

A

The discount factor, usually denoted as γ, is a factor multiplying the future expected reward, and varies on the range of [0,1]. It controls the importance of the future rewards versus the immediate ones. The lower the discount factor is, the less important future rewards are, and the Agent will tend to focus on actions which will yield immediate rewards only.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Environment

A

Everything which isn’t the Agent; everything the Agent can interact with, either directly or indirectly. The environment changes as the Agent performs actions; every such change is considered a state-transition. Every action the Agent performs yields a reward received by the Agent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Episode

A

All states that come in between an initial-state and a terminal-state; for example: one game of Chess. The Agent’s goal it to maximize the total reward it receives during an episode. In situations where there is no terminal-state, we consider an infinite episode. It is important to remember that different episodes are completely independent of one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Episodic Tasks

A

Reinforcement Learning tasks which are made of different episodes (meaning, each episode has a terminal state).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Expected Return

A

Sometimes referred to as “overall reward” and occasionally denoted as G, is the expected reward over an entire episode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Experience Replay

A

As Reinforcement Learning tasks have no pre-generated training sets which they can learn from, the Agent must keep records of all the state-transitions it encountered so it can learn from them later. The memory-buffer used to store this is often referred to as Experience Replay. There are several types and architectures of these memory buffers — but some very common ones are the cyclic memory buffers (which makes sure the Agent keeps training over its new behavior rather than things that might no longer be relevant) and reservoir-sampling-based memory buffers (which guarantees each state-transition recorded has an even probability to be inserted to the buffer).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Exploitation & Exploration

A

Reinforcement Learning tasks have no pre-generated training sets which they can learn from — they create their own experience and learn “on the fly”. To be able to do so, the Agent needs to try many different actions in many different states in order to try and learn all available possibilities and find the path which will maximize its overall reward; this is known as Exploration, as the Agent explores the Environment. On the other hand, if all the Agent will do is explore, it will never maximize the overall reward — it must also use the information it learned to do so. This is known as Exploitation, as the Agent exploits its knowledge to maximize the rewards it receives.
The trade-off between the two is one of the greatest challenges of Reinforcement Learning problems, as the two must be balanced in order to allow the Agent to both explore the environment enough, but also exploit what it learned and repeat the most rewarding path it found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Greedy Policy, ε-Greedy Policy

A

A greedy policy means the Agent constantly performs the action that is believed to yield the highest expected reward. Obviously, such a policy will not allow the Agent to explore at all. In order to still allow some exploration, an ε-greedy policy is often used instead: a number (named ε) in the range of [0,1] is selected, and prior selecting an action, a random number in the range of [0,1] is selected. if that number is larger than ε, the greedy action is selected — but if it’s lower, a random action is selected. Note that if ε=0, the policy becomes the greedy policy, and if ε=1, always explore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Markov Decision Process (MDP)

A

The Markov Property means that each state is dependent solely on its preceding state, the selected action taken from that state and the reward received immediately after that action was executed. Mathematically, it means: s’ = s’(s,a,r), where s’ is the future state, s is its preceding state and a and r are the action and reward. No prior knowledge of what happened before s is needed — the Markov Property assumes that s holds all the relevant information within it. A Markov Decision Process is decision process based on these assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Model-Based & Model-Free

A

Model-based and model-free are two different approaches an Agent can choose when trying to optimize its policy. This is best explained using an example: assume you’re trying to learn how to play Blackjack. You can do so in two ways: one, you calculate in advance, before the game begins, the winning probabilities of all states and all the state-transition probabilities given all the possible actions, and then simply act according to you calculations. The second option is to simply play without any prior knowledge, and gain information using “trial-and-error”. Note that using the first approach, you’re basically modeling your environment, while the second approach requires no information about the environment. This is exactly the difference between model-based and model-free; the first method is model-based, while the latter is model-free.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Monte Carlo (MC)

A

Monte Carlo methods are algorithms which use repeated random sampling in order to achieve a result. They are used quite often in Reinforcement Learning algorithms to obtain expected values; for example — calculating a state Value function by returning to the same state over and over again, and averaging over the actual cumulative reward received each time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

On-Policy & Off-Policy

A

Every Reinforcement Learning algorithm must follow some policy in order to decide which actions to perform at each state. Still, the learning procedure of the algorithm doesn’t have to take into account that policy while learning. Algorithms which concern about the policy which yielded past state-action decisions are referred to as on-policy algorithms, while those ignoring it are known as off-policy.
A well known off-policy algorithm is Q-Learning, as its update rule uses the action which will yield the highest Q-Value, while the actual policy used might restrict that action or choose another. The on-policy variation of Q-Learning is known as Sarsa, where the update rule uses the action chosen by the followed policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Policy (π)

A

The policy, denoted as π (or sometimes π(a|s)), is a mapping from some state s to the probabilities of selecting each possible action given that state. For example, a greedy policy outputs for every state the action with the highest expected Q-Value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q-Learning

A

Q-Learning is an off-policy Reinforcement Learning algorithm, considered as one of the very basic ones. In its most simplified form, it uses a table to store all Q-Values of all possible state-action pairs possible. It updates this table using the Bellman equation, while action selection is usually made with an ε-greedy policy.
In its simplest form (no uncertainties in state-transitions and expected rewards), the update rule of Q-Learning is:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Q Value (Q Function)

A

Usually denoted as Q(s,a) (sometimes with a π subscript, and sometimes as Q(s,a; θ) in Deep RL), Q Value is a measure of the overall expected reward assuming the Agent is in state s and performs action a, and then continues playing until the end of the episode following some policy π. Its name is an abbreviation of the word “Quality”, and it is defined mathematically as:

where N is the number of states from state s till the terminal state, γ is the discount factor and r⁰ is the immediate reward received after performing action a in state s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

REINFORCE Algorithms

A

REINFORCE algorithms are a family of Reinforcement Learning algorithms which update their policy parameters according to the gradient of the policy with respect to the policy-parameters. The name is typically written using capital letters only, as it’s originally an acronym for the original algorithms group design: “REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Reinforcement Learning (RL)

A

Reinforcement Learning is, like Supervised Learning and Unsupervised Learning, one the main areas of Machine Learning and Artificial Intelligence. It is concerned with the learning process of an arbitrary being, formally known as an Agent, in the world surrounding it, known as the Environment. The Agent seeks to maximize the rewards it receives from the Environment, and performs different actions in order to learn how the Environment responds and gain more rewards. One of the greatest challenges of RL tasks is to associate actions with postponed rewards — which are rewards received by the Agent long after the reward-generating action was made. It is therefore heavily used to solve different kind of games, from Tic-Tac-Toe, Chess, Atari 2600 and all the way to Go and StarCraft.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Reward

A

A numerical value received by the Agent from the Environment as a direct response to the Agent’s actions. The Agent’s goal is to maximize the overall reward it receives during an episode, and so rewards are the motivation the Agent needs in order to act in a desired behavior. All actions yield rewards, which can be roughly divided to three types: positive rewards which emphasize a desired action, negative rewards which emphasize an action the Agent should stray away from, and zero, which means the Agent didn’t do anything special or unique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

SARSA

A

The Sarsa algorithm is pretty much the Q-Learning algorithm with a slight modification in order to make it an on-policy algorithm. The Q-Learning update rule is based on the Bellman equation for the optimal Q-Value, and so in the case on no uncertainties in state-transitions and expected rewards, the Q-Learning update rule is:

Q(s,a) = r(s,a) + y*max Q(s’,a)

In order to transform this into an on-policy algorithm, the last term is modified:

Q(s,a) = r(s,a) + y*max Q(s’,a’)

when here, both actions a and a’ are chosen by the same policy. The name of the algorithm is derived from its update rule, which is based on (s,a,r,s’,a’), all coming from the same policy.

29
Q

State

A

Every scenario the Agent encounters in the Environment is formally called a state. The Agent transitions between different states by performing actions. It is also worth mentioning the terminal states, which mark the end of an episode. There are no possible states after a terminal state has been reached, and a new episode begins. Quite often, a terminal state is represented as a special state where all actions transition to the same terminal state with reward 0.

30
Q

Temporal-Difference (TD)

A

Temporal Difference is a learning method which combines both Dynamic Programming and Monte Carlo principles; it learns “on the fly” similarly to Monte Carlo, yet updates its estimates like Dynamic Programming. One of the simplest Temporal Difference algorithms it known as one-step TD or TD(0). It updates the Value Function according to the following update rule:

where V is the Value Function, s is the state, r is the reward, γ is the discount factor, α is a learning rate, t is the time-step and the ‘=’ sign is used as an update operator and not equality. The term found in the squared brackets is known as the temporal difference error.

31
Q

Upper Confident Bound (UCB)

A

UCB is an exploration method which tries to ensure that each action is well explored. Consider an exploration policy which is completely random — meaning, each possible action has the same chance of being selected. There is a chance that some actions will be explored much more than others. The less an action is selected, the less confident the Agent can be about its expected reward, and the its exploitation phase might be harmed. Exploration by UCB takes into account the number of times each action was selected, and gives extra weight to those less-explored. Formalizing this mathematically, the selected action is picked by:

…where R(a) is the expected overall reward of action a, t is the number of steps taken (how many actions were selected overall), N(a) is the number of times action a was selected and c is a configureable hyperparameter. This method is also referred to sometimes as “exploration through optimism”, as it gives less-explored actions a higher value, encouraging the model to select them.

32
Q

Value Function

A

Usually denoted as V(s) (sometimes with a π subscript), the Value function is a measure of the overall expected reward assuming the Agent is in state s and then continues playing until the end of the episode following some policy π. It is defined mathematically as:

While it does seem similar to the definition of Q Value, there is an implicit — yet important — difference: for n=0, the reward r⁰ of V(s) is the expected reward from just being in state s, before any action was played, while in Q Value, r⁰ is the expected reward after a certain action was played. This difference also yields the Advantage function.

33
Q

Controller

A

Same as an agent. The system (like robots) that interacts and acts on the environment.

34
Q

Episode

A

Playing out the whole sequence of state and action until reaching the terminate state or reaching a predefined length of actions.

35
Q

Value function (state-value function)

A

The value function of a state V(s) is the total amount of expected rewards that an agent can collect from that state to the end of the episode.

36
Q

Action-value function

A

The action-value function Q(s, a) is the total amount of expected rewards of taking an action from the state until the end of the episode.

37
Q

Model (system dynamics)

A

A model describes how an environment may change upon an action from a state p(s’ | a, s). This is the system dynamics, the law of Physics, or the rule of the game. For a mechanical problem, it is the model dynamics on how things move.

38
Q

Discount rate γ

A

Discount rate values the future rewards at the present value. It γ is smaller than one, we value future rewards less at the current value.

39
Q

Model-Free RL

A

In model-free RL, the system dynamics is unknown or not needed to solve the task.

40
Q

Model-based RL

A

We use the known model or the learned model to plan the optimal controls in maximizing rewards. Or we can collect those sampled controls to train a policy and hope that it may be generalized to other tasks that we have not trained before.

41
Q

Monte Carlo Method

A

Monte Carlo methods play through a completed episode. It computes the average of the sample returns from multiple episodes to estimate value functions. Or it uses the following running average to update the result.

42
Q

Monte Carlo control

A

We use the Monte Carlo method to evaluate the Q-value function of the current policy and find the optimal option by locating the action with the maximum Q-value functions.

43
Q

Actor-critic

A

Actor-critic combines the concept of Policy Gradient and Value-learning in solving an RL task. We optimize the actor which based on Policy Gradient to determine what actions to take based on observation. However, Policy Gradient often has a high variance of gradient that hurts convergence. We introduce a critic in evaluating a trajectory. This critic makes use of past sampled experience and other techniques that reduce variance. This allows the actor to be trained with better convergence.

44
Q

Policy Gradients

A

We refine our policy for making actions more likely when we expect it to have large expected rewards (or vice versa).

45
Q

Natural Policy Gradients

A

It is similar to Policy Gradients. But Natural Policy Gradient uses a second-order optimization concept which is more accurate but more complex than the Policy Gradient which uses the first-order optimizer.

46
Q

Sample efficiency

A

Sample efficiency relates to how many data samples needed in optimizing or finding the solution. Tasks requiring physical simulation can be expensive and therefore it is an important evaluation factor in selecting RL algorithms.

47
Q

On-policy learning v.s. off-policy learning

A

In on-policy learning, we optimize the current policy and use it to determine what spaces and actions to explore and sample next. Since the current policy is not optimized in early training, a stochastic policy will allow some form of exploration.

Off-policy learning allows a second policy. This policy can be used to enhance how exploration is done. But its key purpose is to collect samples. Off-policy learning allows more controls on how we explore unknown and allows the use of older samples in the calculation. The later improves sample efficiency since we don’t need to recollect samples whenever a policy is changed.

48
Q

Markov Decision Process (MDP)

A

It composes of states, actions, the model P, rewards and the discount factor. Our objective is to find a policy that maximizing the expected rewards.

49
Q

Partially Observable Markov Decision Process POMDP

A

Not all states are observable. If we have enough states information, we can solve the MDP using the states we have (π(s)). Otherwise, we have to derive our policy based on those observables (π(o)).

50
Q

Temporal-Difference Learning (TD)

A

Instead of completing the whole episode like the Monte Carlo (calculating to the end or terminal state)…

We rollout k-steps and collect the rewards. We estimate the value function based on the collected rewards and the value function after k-steps. Below is a 1-step TD learning. We find our the reward after taking one action. The equation below is a running average for V using TD.

51
Q

Compare TD vs Monte Carlo vs Dynamica Programming vs Exhaustive Search

A

See image:

52
Q

Planning

A

We use the system dynamics (model) to generate simulated experience and use them to refit the value functions or the policy.

The difference between learning and planning is one from real experience generated by the environment and one from simulated experience by a model.

53
Q

Q-learning

A

We learn the Q-value function by first taking an action (under a policy like epsilon-greedy) and observe the rewards R. Then we determine the next action with the best Q-value function.

54
Q

Trajectory optimization

A

Finding the best state and the action sequence that minimizing a cost function.

55
Q

Open-loop system

A

We observe the initial state of a system and plan actions to minimize a cost function.

56
Q

Close-loop system

A

We observe the initial state of a system and plan the actions. But during the course, we can observe the next state and readjust any actions. For a stochastic model, this allows us to readjust the response based on what actually occurred. Hence, a close-loop system can be optimized better than an open-loop system.

57
Q

Shooting methods

A

Optimize trajectory based on an open-loop system. Observe the initial (first) state & optimize the corresponding actions.

For a stochastic system, this is suboptimal because we do not readjust the actions based on the observed states where we transit to.

58
Q

Collocation methods

A

Optimize trajectory based on a closed-loop system which we take actions based on the observed states. We manipulate both actions and states in optimizing the cost function.

59
Q

Imitation learning

A

Imitate what an expert may act. The expert can be a human or a program which produce quality samples for the model to learn and to generalize.

60
Q

Inverse reinforcement learning

A

Try to model a reward function (for example, using a deep network) from expert demonstrations. So we can backpropagate rewards to improve policy.

61
Q

Taylor series

A

See image:

62
Q

Taylor Series (in vector form):

A

In vector form…

…where g is the Jacobian matrix and H is the Hessian matrix.

63
Q

Jacobian matrix

A

In vector calculus, the Jacobian matrix (/dʒəˈkoʊbiən/,[1][2][3] /dʒɪ-, jɪ-/) of a vector-valued function in several variables is the matrix of all its first-order partial derivatives.

The Jacobian matrix represents the differential of f at every point where f is differentiable

64
Q

Hessian matrix

A

In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables.

65
Q

KL-divergence

A

In deep learning, we want a model predicting data distribution Q resembling the distribution P from the data. Such a difference between two probability distributions can be measured by KL Divergence.

66
Q

Positive-definite matrix

A

Matrix A is positive-definite if

zTAz > 0

for all real non-zero vectors z.

Ie - A positive definite matrix is a symmetric matrix where every eigenvalue is positive.

There is a vector z.

This z will have a certain direction.

When we multiply matrix M with z, z no longer points in the same direction. The direction of z is transformed by M.

If M is a positive definite matrix, the new direction will always point in “the same general” direction (here “the same general” means less than π/2 angle change). It won’t reverse (= more than 90-degree angle change) the original direction.

Why do we want PDM?

Wouldn’t it be nice in an abstract sense… if you could multiply some matrices multiple times and they won’t change the sign of the vectors? If you multiply positive numbers to other positive numbers, it doesn’t change its sign. I think it’s a neat property for a matrix to have.

Also, if the Hessian of a function is PSD, then the function is convex. (In calculus, the derivative must be zero at the maximum or minimum of the function. To know which, we check the sign of the second derivative. In multiple dimensions, we no longer have just one number to check, we have a matrix -Hessian. It’s a minimum if the Hessian is positive definite and a maximum if it’s negative definite.)

67
Q

Stochastic policy model

A

Stochastic policy models the real world better as we act on incomplete information. In addition, a small change of action can have a big swing on the rewards. The expected rewards will have a much steeper curvature with a deterministic policy. This destabilizes models easily and makes it vulnerable to noise. A stochastic policy will have a smoother surface with less sharp cliff since it tries out combinations of actions which smooth out the total rewards. So stochastic policy will work better with optimization methods like gradient descent.

68
Q
A