Final Exam CS 7642 Flashcards
Markovian Property
Future is independent of past given present.
The probability of you ending up at state s’ given that you started at state s and took and action”a” depends only on state “s”.
You can make anything into markovian process by making certain that your current state “s” remembers anything you need to remember from the past.
Problem of markov property
In Lecture Smoov and bogus … it is mentioned that you can turn any process into markovian by making certain that current state remembers everything from past. The problem with this is that if you have to remeber everything from past , you need to visit each state only once. It will be very difficult to learn anything.
Characteristic on Markov Property
- The model environment is stationary. The rules don’t change
- Transition probability graph, provides a complete representation of discrete time step, finite state , markov chain model
What are delayed reward
It is assumed that rewards in the distant future are not as valuable as immediate rewards.
-> Minor changes in reward function matter
What is Temporal Credit Assignment Problem
At any state actions which lead to higher reward in future, will have more value and for sequence of time steps …temporal credit assignment problem
Why Rewards in RL
Rewards are basically domain knowledge which we are injecting into a system
Policy and utilities
-> Policy depends on time horizon , if infinite time horizon, policy changes
-> IF Utility(s0,s1,s2,s3,……….) > Utility(s0,s1’,s2’,s3’,s4’………)
Then :
Utility(s1,s2,s3,s4,…..). > Utility(s1’,s2,s3’ , s4’………)
–> If I prefer one sequence of states over other sequence of states today then I would prefer one sequence of states over other sequence of states tomorrow.
–> This is called stationarity of sequences
Optimal Policy
Maximizes the long term expected reward.
Optimal policy is the one that for every state returns the action that maximizes my expected utility
Reward and Utility
Reward of entering a state is not same as tility of entering that state .Reward is immediate , utility is accumulated reward from that point on(long term feedback)
Utility
U(s) = R(s) +γ max E T(s,a,s’)U(s’) —————Eq1.
, n states , n equation , non-linear , because of max
Finding Policies
- START with policy “pi0” (guess)
- EVALUATE: Given pi(t) calculate U(t) = U(T) (utility you get by following that policy
- IMPROVE: pi(t+1) = argmax E T(s,a,s’) U(t)
Ut(s) = R(s) + γ E T(s,pi,si’) Ut(s’)——–Eq. 2
n states , n equation and linear . can be solved using matrix , regression etc…
Policy Iteration
In Policy Iteration - You randomly select a policy and find value function corresponding to it ,
then policy evaluation : Making value function consistent to the current policy
then find a new (improved) policy by making policy greedy towards current value function.
and so on this will lead to optimal policy
More about policy Iteration
- sequence of monotonically improving policies and value functions:
Pi(0) —-e- -> V_pi(0) –I—> Pi(1) –e–> V_pi(1) –i–>Pi(2)—-e—>V_pi(2)…..
e is policy evaluation
i is policy improvement
Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
Guaranteed convergence!!
What are we trying to learn?
Behaviour structures
- PLAN :
- > Fixed sequence of actions
- -> During Learning/ After Learning
- -> Stochasticity - CONDITIONAL PLAN :
- ->Instead of fixed set of actions…
- ->Has “IF” and “Else” conditions
- ->sequence of actions .. then branching…Just in Time planning - STATIONARY POLICY /UNIVERSAL PLAN :
- ->Mapping from states to actions, can handle stochasticity very well
- ->In stationary policy, there should be same IF at every possible state.
- -> Very Large
- ->Always optimal stationary policy
- -> possibility of expressing the optimal behaviour
Evaluating a policy
- State Transitions to immediate rewards
- Truncate to finite horizon
- Summarize Sequence: Sum of (γ^i r(i))
- Summarize Sequence Average
Evaluating a learner
Parameters to evaluate a learner:
- Value of the returned policy( How good the returned policy is)
- Computational complexity (Time)
- Sample Complexity( How much data it needs to learn)
Space complexity can also be included , but usually we don’t run into space complexity in RL learners
Drawback of Policy Iteration
One drawback of policy iteration is each iteration has “Policy Evaluation” step . This itself is computationally expensive and requires several sweeps to state set. To overcome this limitation , it was observed that the no. of sweeps can be truncated without losing on the convergence guarantee.
In value iteration , the policy evaluation step is truncated to just 1 sweep over state set.
We start with the value function of an arbitrary policy and then find a new (improved) value function in an
iterative process, until reaching the optimal value function. The optimal policy can be easily derived from optimal value function using Bellman operator.
Classes of Reinforcement Learning algorithms
3 classes of RL algorithms:
- Model Based methods
- Model Free methods
- Policy Search methods
Model Based Model Free Policy Search
———————————————————————–>
Learning is more direct from left to right
Model Based Algorithms
- –> Model Learner –>Transition function/Reward —>MDP Solve —>Q* –>argmax—> π
- State , Action and Reward tuple , send it to Model Learner
- Model learner learns transition and reward, those T/R become input to MDP solver
- MDP Solver will output the Optimal value function Q*
- Use the argmax on the state which you are in , and it will give you which action to take on that state
Model Free Algorithms
Model Free algorithms are VALUE FUNCTION BASED algorithms
* —> Value Update —>Q* –>argmax—> π
The input and output are the same. The input is state, action and reward tuple and the output is policy.
We update the value function and feeding back Q.
Directly learns Q value function and simply apply argmax on Q at your state and it will give you action you should take at that state.
Policy Search based Algorithm
- –> policy Update —> π
Here you take the policy and feed it back to the policy update that directly updates the policy based on the state action reward tuple you have received.
It is more direct , but more complex because updating the policy based on state, action and reward is not very straight forward and get the very little feedback in order to update the policy
Dynamic Programming
- Both Policy iteration and value iteration are Dynamic Programming methods.
- Policy iteration consists of 2 simultaneous interacting processes, one makes the value function consistent with the current policy(Policy evaluation) and other makes the policy greedy with respect to the current value function
(Policy Improvement) - Policy iteration alternates between these 2 processes
each completing before another. - In value iteration there in less policy evaluation which is interleaved between policy improvements
- When both of these process converge (stabilize), that is no longer they produce change , we say that optimal policy has been achieved .
- The value function stabilizes only when it is consistent with the current policy and the policy stabilizes only when it is greedy with respect to the current value function
- These 2 process are “competing and cooperating” at the same time. Making the policy greedy with respect to value function , makes the value function incorrect for the changed policy . Making the value function consistent with the current policy causes the policy no longer to be greedy
Efficiency of DP -1
- Finite state MDPs can either be solved using Linear programming or Dynamic programming.
- MDPs can be solved with dynamic programming(guaranteed) in polynomial time. If “n” is number of states and “k” is no. of actions then a DP method is guaranteed to find an optimal policy in polynomial time. Total number of deterministic policy will be k^n
Efficiency in DP-2
For small number of states linear programming may provide better worst case convergence guarantees than DP . But Linear programming methods become impractical at smaller dimensions than DP .
Efficiency in DP-3
For large problems , DP methods are feasible, but they do suffer from “CURSE OF DIMENTIONALITY” , where the number of states often grow with number of state variables.
DP methods are always better than direct search because “direct search” will exhaustively examine each policy to provide the same guarantee for convergence.
Temporal Difference Learning TD (λ)
Sequence of states (Markov chain) . Learning to predict expected Future Reward
S0 ——r0—> S1——-r1———>S3——–r2———>SF
VT(s1) = VT-1(s1). + α * (RT(s1) - VT-1(s1)
Where alpha = 1/T. LEARNING RATE
(RT(s1) - VT-1(s1) is error
Learning Rate decaying over time…..
Incremental Implementation
So far we learnt that Estimated value is “sample averages of observed Rewards”
Qn = (R1+R2+R3…..Rn-1)/n-1
according to above equation , every time we receive the new reward we need to calculate the average again and
that might need lots of memory to store each reward and compute it .
Qn+1 = Qn + 1/n(Rn - Qn), here 1/n —> alpha
New Estimate = Old estimate + stepsize (Target - old estimate).
This (Target - old estimate) is error in the estimate , which is reduced everytime by taking the step toward the target. Target indicates the desirable direction to move.
Why need Temporal Difference Learning
Monte Carlo (MC) techniques attempt to sample their way to an estimation of a value function. They can do this without explicit knowledge of the transition probabilities and can efficiently sample large state spaces. But they need to run for an entire episode before the agent can update the policy.
Conversely, dynamic programming (DP) methods bootstrap by updating the policy after a single time step. But DP algorithms must have complete knowledge of the transition probabilities and visit every possible state and action before they can find an optimal policy.
Temporal-difference (TD) learning is a combination of these two approaches. It learns directly from experience by sampling, but also bootstraps. This represents a breakthrough in capability that allows agents to learn optimal strategies in any environment. Prior to this point learning was so slow it made problems intractable or you needed a full model of the environment.
In TD(lambda) empirically converges for lambda values in the range 0.3 to 0.7