Quiz 5 Flashcards
MDP: State
The possible scenarios of the world the agent can be in.
MDP: Actions
Set of actions the agent can take based on its state
MDP: Environment
- Environment produces a state which the agent can persist
- Gives rewards to agent for actions it takes
- Environment may be unknown, non-linear, stochastic and complex
Dynamic programming methods for solving MDPs
Bellman Optimality Equation - Update value matrix at each iteration by applying the Bellman equation until convergence.
RL: Why is setting data gathering policy to be the same as greedy train policies is a bad idea.
- Greedy training will not have enough incentive to explore other less rewarding states that may lead to higher reward
- Breaks IID
State value function (V-function)
“Expected discounted sum of rewards from state s”
State-action value function (Q-value)
“Expected cumulative reward upon taking action a in state s”
RL: 4 challenges of RL
- Evaluative feedback - need trial and error to find the right action
- Delayed feedback - actions may not lead to immediate reward
- Non-stationary - Data distribution of visited states changes when policy changes
- Fleeting nature of time and online data
RL: Components of DQN
- Experience replay
- Epsilon greedy search
- Q-update
MDP: Model
The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state 𝑠′
MDP: Policy
Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.
MDP: Markovian property
Only the present matters
Bellman’s Equation
The true utility of a state is its immediate reward plus all discounted future rewards (utility)
Difference between value iteration and policy iteration
VI: Finds optimal value functions + policy extraction (just once)
PI: Policy evaluation + policy improvement (repeated)
Experience replay
Agent keeps memory bank that stores past experience. Instead of using immediate experience, sample from memory buffer.
REINFORCE (policy gradient)
- Define parameterized policy
- Generate trajectories based on policy. Gets state, actions and rewards.
- Compute objective function (expected sum of rewards over all time steps)
- Compute the gradient (loss with respect to policy params)
- Update policy params
- Repeat until convergence
Drawbacks of policy gradients
Coarse rewards. Can’t assign credit to subset of actions that were good or bad.
How does experience replay solve problem of correlated data
By randomly sampling from the replay buffer, the training data becomes less correlated. This helps to stabilize and accelerate the learning process.
Diff between Q-learning and Deep Q-Networks
How Q-values are represented.
Q-learning uses table of discrete state and action.
DQN uses NN to approximate Q-values.
VI: Time complexity per iteration
O(|S|^2 |A|)
VI / Q-learning - How does it differ in how it perform updates
Q loops over actions as well as states
Why do policy iteration?
Policy converges faster
Deep Q-learning - What 2 things to do for stability during learning
- Freeze Q_old and update Q_new parameters
- Set Q_old <- Q_new at regular intervals
Loss for Deep Q-learning
MSE Loss
Dependency of value/policy iteration
Must know transition and reward functions
2 strategies if transition and reward function unknown
- Estimate transition / reward function.
- Estimate Q-values from data (DQNs, etc)
What 2 components of trad RL does policy gradient not require?
- Environment model
- Reward function
Policy gradient: likelihood ratio policy gradient
increases the (log) probability of the trajectories with high reward and decreases the (log) probability of the trajectories with low reward
Key difference between TD Learning and SARSA
TD: Action in next state can be any action. Update is based on expected value over all possible next actions.
SARSA: Action in the next state is one actually taken in the environment. Update is based on the Q-value of the action actually chosen.