1. START with policy "pi0" (guess) 2. EVALUATE: Given pi(t) calculate U(t) = U(T) (utility you get by following that policy 3. IMPROVE: pi(t+1) = argmax E T(s,a,s') U(t) Ut(s) = R(s) + γ E T(s,pi,si') Ut(s')--------Eq. 2 n states , n equation and linear . can be solved using matrix , regression etc...

1. State Transitions to immediate rewards 2. Truncate to finite horizon 3. Summarize Sequence: Sum of (γ^i r(i)) 4. Summarize Sequence Average

Final Exam CS 7642 Flashcards by Anshuta Awasthi

Markovian Property

Future is independent of past given present.
The probability of you ending up at state s’ given that you started at state s and took and action”a” depends only on state “s”.
You can make anything into markovian process by making certain that your current state “s” remembers anything you need to remember from the past.

How well did you know this?

Not at all

Perfectly

Problem of markov property

In Lecture Smoov and bogus … it is mentioned that you can turn any process into markovian by making certain that current state remembers everything from past. The problem with this is that if you have to remeber everything from past , you need to visit each state only once. It will be very difficult to learn anything.

How well did you know this?

Not at all

Perfectly

Characteristic on Markov Property

The model environment is stationary. The rules don’t change
Transition probability graph, provides a complete representation of discrete time step, finite state , markov chain model

How well did you know this?

Not at all

Perfectly

What are delayed reward

It is assumed that rewards in the distant future are not as valuable as immediate rewards.
-> Minor changes in reward function matter

How well did you know this?

Not at all

Perfectly

What is Temporal Credit Assignment Problem

At any state actions which lead to higher reward in future, will have more value and for sequence of time steps …temporal credit assignment problem

How well did you know this?

Not at all

Perfectly

Why Rewards in RL

Rewards are basically domain knowledge which we are injecting into a system

How well did you know this?

Not at all

Perfectly

Policy and utilities

-> Policy depends on time horizon , if infinite time horizon, policy changes
-> IF Utility(s0,s1,s2,s3,……….) > Utility(s0,s1’,s2’,s3’,s4’………)
Then :
Utility(s1,s2,s3,s4,…..). > Utility(s1’,s2,s3’ , s4’………)
–> If I prefer one sequence of states over other sequence of states today then I would prefer one sequence of states over other sequence of states tomorrow.
–> This is called stationarity of sequences

How well did you know this?

Not at all

Perfectly

Optimal Policy

Maximizes the long term expected reward.

Optimal policy is the one that for every state returns the action that maximizes my expected utility

How well did you know this?

Not at all

Perfectly

Reward and Utility

Reward of entering a state is not same as tility of entering that state .Reward is immediate , utility is accumulated reward from that point on(long term feedback)

How well did you know this?

Not at all

Perfectly

Utility

U(s) = R(s) +γ max E T(s,a,s’)U(s’) —————Eq1.

, n states , n equation , non-linear , because of max

How well did you know this?

Not at all

Perfectly

Finding Policies

START with policy “pi0” (guess)
EVALUATE: Given pi(t) calculate U(t) = U(T) (utility you get by following that policy
IMPROVE: pi(t+1) = argmax E T(s,a,s’) U(t)

Ut(s) = R(s) + γ E T(s,pi,si’) Ut(s’)——–Eq. 2
n states , n equation and linear . can be solved using matrix , regression etc…

How well did you know this?

Not at all

Perfectly

Policy Iteration

In Policy Iteration - You randomly select a policy and find value function corresponding to it ,
then policy evaluation : Making value function consistent to the current policy
then find a new (improved) policy by making policy greedy towards current value function.

and so on this will lead to optimal policy

How well did you know this?

Not at all

Perfectly

More about policy Iteration

sequence of monotonically improving policies and value functions:
Pi(0) —-e- -> V_pi(0) –I—> Pi(1) –e–> V_pi(1) –i–>Pi(2)—-e—>V_pi(2)…..

e is policy evaluation
i is policy improvement

Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.

Guaranteed convergence!!

How well did you know this?

Not at all

Perfectly

What are we trying to learn?

Behaviour structures

PLAN :
- > Fixed sequence of actions
- -> During Learning/ After Learning
- -> Stochasticity
CONDITIONAL PLAN :
- ->Instead of fixed set of actions…
- ->Has “IF” and “Else” conditions
- ->sequence of actions .. then branching…Just in Time planning
STATIONARY POLICY /UNIVERSAL PLAN :
- ->Mapping from states to actions, can handle stochasticity very well
- ->In stationary policy, there should be same IF at every possible state.
- -> Very Large
- ->Always optimal stationary policy
- -> possibility of expressing the optimal behaviour

How well did you know this?

Not at all

Perfectly

Evaluating a policy

State Transitions to immediate rewards
Truncate to finite horizon
Summarize Sequence: Sum of (γ^i r(i))
Summarize Sequence Average

How well did you know this?

Not at all

Perfectly

Evaluating a learner

Study These Flashcards

Parameters to evaluate a learner:

Value of the returned policy( How good the returned policy is)
Computational complexity (Time)
Sample Complexity( How much data it needs to learn)

Space complexity can also be included , but usually we don’t run into space complexity in RL learners

Drawback of Policy Iteration

Study These Flashcards

One drawback of policy iteration is each iteration has “Policy Evaluation” step . This itself is computationally expensive and requires several sweeps to state set. To overcome this limitation , it was observed that the no. of sweeps can be truncated without losing on the convergence guarantee.
In value iteration , the policy evaluation step is truncated to just 1 sweep over state set.
We start with the value function of an arbitrary policy and then find a new (improved) value function in an
iterative process, until reaching the optimal value function. The optimal policy can be easily derived from optimal value function using Bellman operator.

Classes of Reinforcement Learning algorithms

Study These Flashcards

3 classes of RL algorithms:

Model Based methods
Model Free methods
Policy Search methods

Model Based Model Free Policy Search
———————————————————————–>
Learning is more direct from left to right

Model Based Algorithms

Study These Flashcards

–> Model Learner –>Transition function/Reward —>MDP Solve —>Q* –>argmax—> π

State , Action and Reward tuple , send it to Model Learner
Model learner learns transition and reward, those T/R become input to MDP solver
MDP Solver will output the Optimal value function Q*
Use the argmax on the state which you are in , and it will give you which action to take on that state

Model Free Algorithms

Study These Flashcards

Model Free algorithms are VALUE FUNCTION BASED algorithms
* —> Value Update —>Q* –>argmax—> π

The input and output are the same. The input is state, action and reward tuple and the output is policy.
We update the value function and feeding back Q.
Directly learns Q value function and simply apply argmax on Q at your state and it will give you action you should take at that state.

Policy Search based Algorithm

Study These Flashcards

–> policy Update —> π
Here you take the policy and feed it back to the policy update that directly updates the policy based on the state action reward tuple you have received.
It is more direct , but more complex because updating the policy based on state, action and reward is not very straight forward and get the very little feedback in order to update the policy

Dynamic Programming

Study These Flashcards

Both Policy iteration and value iteration are Dynamic Programming methods.
Policy iteration consists of 2 simultaneous interacting processes, one makes the value function consistent with the current policy(Policy evaluation) and other makes the policy greedy with respect to the current value function
(Policy Improvement)
Policy iteration alternates between these 2 processes
each completing before another.
In value iteration there in less policy evaluation which is interleaved between policy improvements
When both of these process converge (stabilize), that is no longer they produce change , we say that optimal policy has been achieved .
The value function stabilizes only when it is consistent with the current policy and the policy stabilizes only when it is greedy with respect to the current value function
These 2 process are “competing and cooperating” at the same time. Making the policy greedy with respect to value function , makes the value function incorrect for the changed policy . Making the value function consistent with the current policy causes the policy no longer to be greedy

Efficiency of DP -1

Study These Flashcards

Finite state MDPs can either be solved using Linear programming or Dynamic programming.
MDPs can be solved with dynamic programming(guaranteed) in polynomial time. If “n” is number of states and “k” is no. of actions then a DP method is guaranteed to find an optimal policy in polynomial time. Total number of deterministic policy will be k^n

Efficiency in DP-2

Study These Flashcards

For small number of states linear programming may provide better worst case convergence guarantees than DP . But Linear programming methods become impractical at smaller dimensions than DP .

Efficiency in DP-3

For large problems , DP methods are feasible, but they do suffer from "CURSE OF DIMENTIONALITY" , where the number of states often grow with number of state variables. DP methods are always better than direct search because "direct search" will exhaustively examine each policy to provide the same guarantee for convergence.

Temporal Difference Learning TD (λ)

Sequence of states (Markov chain) . Learning to predict expected Future Reward S0 ------r0---> S1-------r1--------->S3--------r2--------->SF VT(s1) = VT-1(s1). + α * (RT(s1) - VT-1(s1) Where alpha = 1/T. LEARNING RATE (RT(s1) - VT-1(s1) is error Learning Rate decaying over time.....

Incremental Implementation

So far we learnt that Estimated value is "sample averages of observed Rewards" Qn = (R1+R2+R3.....Rn-1)/n-1 according to above equation , every time we receive the new reward we need to calculate the average again and that might need lots of memory to store each reward and compute it . Qn+1 = Qn + 1/n*(Rn - Qn), here 1/n ---> alpha New Estimate = Old estimate + stepsize* (Target - old estimate). This (Target - old estimate) is error in the estimate , which is reduced everytime by taking the step toward the target. Target indicates the desirable direction to move.

Why need Temporal Difference Learning

Monte Carlo (MC) techniques attempt to sample their way to an estimation of a value function. They can do this without explicit knowledge of the transition probabilities and can efficiently sample large state spaces. But they need to run for an entire episode before the agent can update the policy. Conversely, dynamic programming (DP) methods bootstrap by updating the policy after a single time step. But DP algorithms must have complete knowledge of the transition probabilities and visit every possible state and action before they can find an optimal policy. Temporal-difference (TD) learning is a combination of these two approaches. It learns directly from experience by sampling, but also bootstraps. This represents a breakthrough in capability that allows agents to learn optimal strategies in any environment. Prior to this point learning was so slow it made problems intractable or you needed a full model of the environment. In TD(lambda) empirically converges for lambda values in the range 0.3 to 0.7

Final Exam CS 7642 Flashcards

(28 cards)