2 - MDP Planning Flashcards by Mundy Reimer

What is policy iteration?

In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the *Bellman operator*

How well did you know this?

Not at all

Perfectly

What is value iteration?

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the *optimality Bellman operator*. The optimality Bellman operator contains a maxoperator, which is non linear and, therefore, it has different features.

How well did you know this?

Not at all

Perfectly

What does P(s’ | s,a) stand for?

This is the Transition Probability. It indicates the probability of reaching state s’ from state s given that you take the action a.

How well did you know this?

Not at all

Perfectly

What does P(r | s,a) stand for?

This is the Reward Probability. It indicates the chance of getting reward R given that you take an action a on state s.

How well did you know this?

Not at all

Perfectly

What is pi(s) and what does it stand for?

pi(s): states s –> actions (a) This is the Policy that can either be deterministic or stochastic.

How well did you know this?

Not at all

Perfectly

How do we mathematically represent choosing the most optimal pi(s)?

pi(s) = Max(V^pi) In otherwise, to find the optimal policy pi, we maximize the Value at that policy.

How well did you know this?

Not at all

Perfectly

How do we represent a *set* of states and actions that *lead to* a desired end state?

We can represent our trajectory using Tau = (S0, A, S1, A, S2, A…etc)

How well did you know this?

Not at all

Perfectly

How do we calculate the Value V of a policy?

We sum the product of our rewards from t=0 to infinity (or our horizon) with the corresponding discount factor gamma (a value between 0-1) that is raised to a power t, for a given policy. Because these rewards are stochastic then we also take the expected value of this sum of rewards. Thus, V_pi = E [Summation (gamma^t * Reward_t)]

How well did you know this?

Not at all

Perfectly

What is a Discount Factor and how do we represent it?

A discount factor is the exponentially decreasing weight we give to rewards farther and farther into the future. It is represented by gamma raised to the power of t (our time step). It is usually a value between 0 and 1.

How well did you know this?

Not at all

Perfectly

What does the Q stand for in Q-Learning and what does it mean?

It stands for “Quality”. The optimal Q function, Q*(s,a) means the expected total reward received by an agent starting in s, picking a, and then behaving optimally afterwards. It is an indication for how good it is for an agent to pick an action while being in state s.

How well did you know this?

Not at all

Perfectly

How do we calculate the Q(s,a) of a given policy pi?

How well did you know this?

Not at all

Perfectly

Is Q-learning a model-based or model-free algorithm?

Model free

How well did you know this?

Not at all

Perfectly

In reinforcement learning the goal of the agent is to discover the optimal ____? And what does this term mean?

Policy = what actions to take in each state, such that it maximizes the total rewards received from the environment in response to its actions.

How well did you know this?

Not at all

Perfectly

How do we calculate Total Discounted Reward?

How well did you know this?

Not at all

Perfectly

What is the difference between *Offline* vs *Online* learning?

These are two fundamental methods for solving MDPs. Both value-iteration and policy-iteration assume that the agent knows the MDP model of the world (i.e. the agent knows the state-transition and reward probability functions). Therefore, they can be used by the agent to (offline) plan its actions given knowledge about the environment before interacting with it. In Q-learning the agent improves its behavior (online) through learning from the history of interactions with the environment.

How well did you know this?

Not at all

Perfectly

What is the Value function and how do we calculate it?

The value function represent how good is a state for an agent to be in. It is equal to expected total reward for an agent starting from state s. The value function depends on the policy by which the agent picks actions to perform. So, if the agent uses a given policy 𝛑 to select actions, the corresponding value function is given by:

What is the Optimal Value function and how do we calculate it?

Among all possible value-functions, there exist an optimal value function that has higher value than other functions for all states.

The optimal policy 𝛑* is the policy that corresponds to optimal value function.

What do you input into the Q function and what *type* of value do you return as output?

What is the relationship between the Value, Policy, and Q*?

Derive the Bellman equation from the following:

What is Value Iteration, and give a pseudo-code algorithm for it…

Value iteration computes the optimal state value function by iteratively improving the estimate of V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and V(s) values until they converges. Value iteration is guaranteed to converge to the optimal values. This algorithm is shown in the following pseudo-code:

What is Policy Iteration and why is it better/worse than Value Iteration? Give the pseudocode for the algorithm.

While value-iteration algorithm keeps improving the value function at each iteration until the value-function converges. Since the agent only cares about the finding the optimal policy, sometimes the optimal policy will converge before the value function. Therefore, another algorithm called policy-iteration instead of repeated improving the value-function estimate, it will re-define the policy at each step and compute the value according to this new policy until the policy converges. Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm.

What is model-based learning?

In model-based learning, the agent will interact to the environment and from the history of its interactions, the agent will try to approximate the environment state transition and reward models. Afterwards, given the models it learnt, the agent can use value-iteration or policy-iteration to find an optimal policy.

What is model-free learning? What is an example of this?

in model-free learning, the agent will not try to learn explicit models of the environment state transition and reward functions. However, it directly derives an optimal policy from the interactions with the environment. Q-learning is an example of this.

What is Q-Learning?

Q-Learning is an example of model-free learning algorithm. It does not assume that agent knows anything about the state-transition and reward models. However, the agent will discover what are the good and bad actions by trial and error. The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the samples of Q(s, a) that we observe during interaction with the enviornment. This approach is known as Time-Difference Learning.

How do we calculate Q(s,a) in time difference learning?

𝛂 is the learning rate. The Q(s,a)table is initialized randomly. Then the agent starts to interact with the environment, and upon each interaction the agent will observe the reward of its action r(s,a)and the state transition (new state s'). The agent compute the observed Q-value Q\_obs(s, a) and then use the above equation to update its own estimate of Q(s,a) .

Explain every part of the Q function.

How to we express V(s) in matrix notation?