Final Exam CS 7642 Flashcards

1
Q

Markovian Property

A

Future is independent of past given present.
The probability of you ending up at state s’ given that you started at state s and took and action”a” depends only on state “s”.
You can make anything into markovian process by making certain that your current state “s” remembers anything you need to remember from the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Problem of markov property

A

In Lecture Smoov and bogus … it is mentioned that you can turn any process into markovian by making certain that current state remembers everything from past. The problem with this is that if you have to remeber everything from past , you need to visit each state only once. It will be very difficult to learn anything.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Characteristic on Markov Property

A
  1. The model environment is stationary. The rules don’t change
  2. Transition probability graph, provides a complete representation of discrete time step, finite state , markov chain model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are delayed reward

A

It is assumed that rewards in the distant future are not as valuable as immediate rewards.
-> Minor changes in reward function matter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Temporal Credit Assignment Problem

A

At any state actions which lead to higher reward in future, will have more value and for sequence of time steps …temporal credit assignment problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why Rewards in RL

A

Rewards are basically domain knowledge which we are injecting into a system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Policy and utilities

A

-> Policy depends on time horizon , if infinite time horizon, policy changes
-> IF Utility(s0,s1,s2,s3,……….) > Utility(s0,s1’,s2’,s3’,s4’………)
Then :
Utility(s1,s2,s3,s4,…..). > Utility(s1’,s2,s3’ , s4’………)
–> If I prefer one sequence of states over other sequence of states today then I would prefer one sequence of states over other sequence of states tomorrow.
–> This is called stationarity of sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Optimal Policy

A

Maximizes the long term expected reward.

Optimal policy is the one that for every state returns the action that maximizes my expected utility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Reward and Utility

A

Reward of entering a state is not same as tility of entering that state .Reward is immediate , utility is accumulated reward from that point on(long term feedback)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Utility

A

U(s) = R(s) +γ max E T(s,a,s’)U(s’) —————Eq1.

, n states , n equation , non-linear , because of max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Finding Policies

A
  1. START with policy “pi0” (guess)
  2. EVALUATE: Given pi(t) calculate U(t) = U(T) (utility you get by following that policy
  3. IMPROVE: pi(t+1) = argmax E T(s,a,s’) U(t)

Ut(s) = R(s) + γ E T(s,pi,si’) Ut(s’)——–Eq. 2
n states , n equation and linear . can be solved using matrix , regression etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Policy Iteration

A

In Policy Iteration - You randomly select a policy and find value function corresponding to it ,
then policy evaluation : Making value function consistent to the current policy
then find a new (improved) policy by making policy greedy towards current value function.

and so on this will lead to optimal policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

More about policy Iteration

A
  1. sequence of monotonically improving policies and value functions:
    Pi(0) —-e- -> V_pi(0) –I—> Pi(1) –e–> V_pi(1) –i–>Pi(2)—-e—>V_pi(2)…..

e is policy evaluation
i is policy improvement

Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.

Guaranteed convergence!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are we trying to learn?

Behaviour structures

A
  1. PLAN :
    - > Fixed sequence of actions
    - -> During Learning/ After Learning
    - -> Stochasticity
  2. CONDITIONAL PLAN :
    - ->Instead of fixed set of actions…
    - ->Has “IF” and “Else” conditions
    - ->sequence of actions .. then branching…Just in Time planning
  3. STATIONARY POLICY /UNIVERSAL PLAN :
    - ->Mapping from states to actions, can handle stochasticity very well
    - ->In stationary policy, there should be same IF at every possible state.
    - -> Very Large
    - ->Always optimal stationary policy
    - -> possibility of expressing the optimal behaviour
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Evaluating a policy

A
  1. State Transitions to immediate rewards
  2. Truncate to finite horizon
  3. Summarize Sequence: Sum of (γ^i r(i))
  4. Summarize Sequence Average
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Evaluating a learner

A

Parameters to evaluate a learner:

  1. Value of the returned policy( How good the returned policy is)
  2. Computational complexity (Time)
  3. Sample Complexity( How much data it needs to learn)

Space complexity can also be included , but usually we don’t run into space complexity in RL learners

17
Q

Drawback of Policy Iteration

A

One drawback of policy iteration is each iteration has “Policy Evaluation” step . This itself is computationally expensive and requires several sweeps to state set. To overcome this limitation , it was observed that the no. of sweeps can be truncated without losing on the convergence guarantee.
In value iteration , the policy evaluation step is truncated to just 1 sweep over state set.
We start with the value function of an arbitrary policy and then find a new (improved) value function in an
iterative process, until reaching the optimal value function. The optimal policy can be easily derived from optimal value function using Bellman operator.

18
Q

Classes of Reinforcement Learning algorithms

A

3 classes of RL algorithms:

  1. Model Based methods
  2. Model Free methods
  3. Policy Search methods

Model Based Model Free Policy Search
———————————————————————–>
Learning is more direct from left to right

19
Q

Model Based Algorithms

A
  • –> Model Learner –>Transition function/Reward —>MDP Solve —>Q* –>argmax—> π
  1. State , Action and Reward tuple , send it to Model Learner
  2. Model learner learns transition and reward, those T/R become input to MDP solver
  3. MDP Solver will output the Optimal value function Q*
  4. Use the argmax on the state which you are in , and it will give you which action to take on that state
20
Q

Model Free Algorithms

A

Model Free algorithms are VALUE FUNCTION BASED algorithms
* —> Value Update —>Q* –>argmax—> π

The input and output are the same. The input is state, action and reward tuple and the output is policy.
We update the value function and feeding back Q.
Directly learns Q value function and simply apply argmax on Q at your state and it will give you action you should take at that state.

21
Q

Policy Search based Algorithm

A
  • –> policy Update —> π
    Here you take the policy and feed it back to the policy update that directly updates the policy based on the state action reward tuple you have received.
    It is more direct , but more complex because updating the policy based on state, action and reward is not very straight forward and get the very little feedback in order to update the policy
22
Q

Dynamic Programming

A
  1. Both Policy iteration and value iteration are Dynamic Programming methods.
  2. Policy iteration consists of 2 simultaneous interacting processes, one makes the value function consistent with the current policy(Policy evaluation) and other makes the policy greedy with respect to the current value function
    (Policy Improvement)
  3. Policy iteration alternates between these 2 processes
    each completing before another.
  4. In value iteration there in less policy evaluation which is interleaved between policy improvements
  5. When both of these process converge (stabilize), that is no longer they produce change , we say that optimal policy has been achieved .
  6. The value function stabilizes only when it is consistent with the current policy and the policy stabilizes only when it is greedy with respect to the current value function
  7. These 2 process are “competing and cooperating” at the same time. Making the policy greedy with respect to value function , makes the value function incorrect for the changed policy . Making the value function consistent with the current policy causes the policy no longer to be greedy
23
Q

Efficiency of DP -1

A
  1. Finite state MDPs can either be solved using Linear programming or Dynamic programming.
  2. MDPs can be solved with dynamic programming(guaranteed) in polynomial time. If “n” is number of states and “k” is no. of actions then a DP method is guaranteed to find an optimal policy in polynomial time. Total number of deterministic policy will be k^n
24
Q

Efficiency in DP-2

A

For small number of states linear programming may provide better worst case convergence guarantees than DP . But Linear programming methods become impractical at smaller dimensions than DP .

25
Q

Efficiency in DP-3

A

For large problems , DP methods are feasible, but they do suffer from “CURSE OF DIMENTIONALITY” , where the number of states often grow with number of state variables.

DP methods are always better than direct search because “direct search” will exhaustively examine each policy to provide the same guarantee for convergence.

26
Q

Temporal Difference Learning TD (λ)

A

Sequence of states (Markov chain) . Learning to predict expected Future Reward
S0 ——r0—> S1——-r1———>S3——–r2———>SF

VT(s1) = VT-1(s1). + α * (RT(s1) - VT-1(s1)

Where alpha = 1/T. LEARNING RATE
(RT(s1) - VT-1(s1) is error
Learning Rate decaying over time…..

27
Q

Incremental Implementation

A

So far we learnt that Estimated value is “sample averages of observed Rewards”
Qn = (R1+R2+R3…..Rn-1)/n-1
according to above equation , every time we receive the new reward we need to calculate the average again and
that might need lots of memory to store each reward and compute it .

Qn+1 = Qn + 1/n(Rn - Qn), here 1/n —> alpha
New Estimate = Old estimate + stepsize
(Target - old estimate).

This (Target - old estimate) is error in the estimate , which is reduced everytime by taking the step toward the target. Target indicates the desirable direction to move.

28
Q

Why need Temporal Difference Learning

A

Monte Carlo (MC) techniques attempt to sample their way to an estimation of a value function. They can do this without explicit knowledge of the transition probabilities and can efficiently sample large state spaces. But they need to run for an entire episode before the agent can update the policy.

Conversely, dynamic programming (DP) methods bootstrap by updating the policy after a single time step. But DP algorithms must have complete knowledge of the transition probabilities and visit every possible state and action before they can find an optimal policy.

Temporal-difference (TD) learning is a combination of these two approaches. It learns directly from experience by sampling, but also bootstraps. This represents a breakthrough in capability that allows agents to learn optimal strategies in any environment. Prior to this point learning was so slow it made problems intractable or you needed a full model of the environment.

In TD(lambda) empirically converges for lambda values in the range 0.3 to 0.7