Final Exam CS 7642 Flashcards
Markovian Property
Future is independent of past given present.
The probability of you ending up at state s’ given that you started at state s and took and action”a” depends only on state “s”.
You can make anything into markovian process by making certain that your current state “s” remembers anything you need to remember from the past.
Problem of markov property
In Lecture Smoov and bogus … it is mentioned that you can turn any process into markovian by making certain that current state remembers everything from past. The problem with this is that if you have to remeber everything from past , you need to visit each state only once. It will be very difficult to learn anything.
Characteristic on Markov Property
- The model environment is stationary. The rules don’t change
- Transition probability graph, provides a complete representation of discrete time step, finite state , markov chain model
What are delayed reward
It is assumed that rewards in the distant future are not as valuable as immediate rewards.
-> Minor changes in reward function matter
What is Temporal Credit Assignment Problem
At any state actions which lead to higher reward in future, will have more value and for sequence of time steps …temporal credit assignment problem
Why Rewards in RL
Rewards are basically domain knowledge which we are injecting into a system
Policy and utilities
-> Policy depends on time horizon , if infinite time horizon, policy changes
-> IF Utility(s0,s1,s2,s3,……….) > Utility(s0,s1’,s2’,s3’,s4’………)
Then :
Utility(s1,s2,s3,s4,…..). > Utility(s1’,s2,s3’ , s4’………)
–> If I prefer one sequence of states over other sequence of states today then I would prefer one sequence of states over other sequence of states tomorrow.
–> This is called stationarity of sequences
Optimal Policy
Maximizes the long term expected reward.
Optimal policy is the one that for every state returns the action that maximizes my expected utility
Reward and Utility
Reward of entering a state is not same as tility of entering that state .Reward is immediate , utility is accumulated reward from that point on(long term feedback)
Utility
U(s) = R(s) +γ max E T(s,a,s’)U(s’) —————Eq1.
, n states , n equation , non-linear , because of max
Finding Policies
- START with policy “pi0” (guess)
- EVALUATE: Given pi(t) calculate U(t) = U(T) (utility you get by following that policy
- IMPROVE: pi(t+1) = argmax E T(s,a,s’) U(t)
Ut(s) = R(s) + γ E T(s,pi,si’) Ut(s’)——–Eq. 2
n states , n equation and linear . can be solved using matrix , regression etc…
Policy Iteration
In Policy Iteration - You randomly select a policy and find value function corresponding to it ,
then policy evaluation : Making value function consistent to the current policy
then find a new (improved) policy by making policy greedy towards current value function.
and so on this will lead to optimal policy
More about policy Iteration
- sequence of monotonically improving policies and value functions:
Pi(0) —-e- -> V_pi(0) –I—> Pi(1) –e–> V_pi(1) –i–>Pi(2)—-e—>V_pi(2)…..
e is policy evaluation
i is policy improvement
Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
Guaranteed convergence!!
What are we trying to learn?
Behaviour structures
- PLAN :
- > Fixed sequence of actions
- -> During Learning/ After Learning
- -> Stochasticity - CONDITIONAL PLAN :
- ->Instead of fixed set of actions…
- ->Has “IF” and “Else” conditions
- ->sequence of actions .. then branching…Just in Time planning - STATIONARY POLICY /UNIVERSAL PLAN :
- ->Mapping from states to actions, can handle stochasticity very well
- ->In stationary policy, there should be same IF at every possible state.
- -> Very Large
- ->Always optimal stationary policy
- -> possibility of expressing the optimal behaviour
Evaluating a policy
- State Transitions to immediate rewards
- Truncate to finite horizon
- Summarize Sequence: Sum of (γ^i r(i))
- Summarize Sequence Average