Module 8 Flashcards
What is reinforcement learning
- based on rewarding desired behaviors / punishing undesired ones
What is reinforcement agent capable of
- is able to perceive and interpret its env to take actions and learn through trial and error
Where can reinforcement learning operate?
as long as a clear reward can be applied
What is optimal policy
- yields highest expected utility
What does a Markov decision process contain
- Possible world states S
- Set of Models
- Set of possible actions A
- reward function R(s,a)
- A policy the solution of MDP
What is a state in MDP
- set of tokens that represent every state the agent can be in
What is a model / transition model in MDP
- Gives an action’s effect in a state
How is the transition model defined
- defined by T(S, a, S’)
- in state S, take action A, ends in State S’
How does the modal differ in stochastic actions?
Add in probability P(S’| S,a) - probability of S’ given S and a
What is they key feature of Markov Property
effects of an action taken in a state depend only on that state and not prior history
What is an action in MDP
- set of all possible action
- A(s) defines the set of actions that can eb taken given state s
What is a reward in MDP
- real values reward function
- R(s) indicates reward for being in state s
- R(s,a) indicates reward being in state s after taking action a
- R(S, a, S’) indicates reward for being in state S’ from S after action A
What is policy in MDP
- solution to the MDP
- maps from S to a
- indicates action a to be taken while in state S
What do MDP solutions usually involve?
dynamic programming
- recursively breaking a problem into pieces while remembering optimal solutions to each piece
How is the quality measured of a policy
- measured through expected utility
- denoted by pi*
What is the goal of MDP and what role does RL play
Goal - maximize cumulative reward in LT
RL - transitions and rewards usually not available
- how to change policy given experience
- how to explore environment
Describe Episodic vs continuing tasks in MDP (optimality/horizon)?
Episodic
- finite horzion = game ends after N steps
- optimal policy depends on N - harder to analyze
- Policy depends on time = nonstationary
Continuing tasks
- infinite horizon = no time limit
- optimal action depends on current state and is stationary
What are additive rewards
- infinite value for continuing tasks
what are discontinued rewards
where y is 0 < 1 - discount factor describes preference if an agent for current rewards over future rewards
where y is close to 0 - rewards in distant future are insignificant
where y is close to 1 - agent is more willing to wait for long-term rewards
when y is exactly 1 - discounted rewards reduce to the special case of purely additive rewards
What is the utility of the state
- expected reward for next transition + discounted utility of next stare assuming agent chooses optimal solution
- given by bellman equation
What is the state value function
denoted - U pi (s)
- expected return when starting in s and following pi
What is the state-action value function
denoted - Q pi (s, a) AKA Q funtion
- expected return when starting in s, performing a and following pi
What are value functions useful for
useful for finding the optimal policy
- can est from experience
- pick the best action using Q function
How does RL differ from MDP
- Don’t know transition model T or Reward R
- must try actions and states to learn
What are the basic ideas of RL
Exploration - you have to try unknown actions to get info
Exploitation - you have to use what you know
Sampling - you may need to repeat many times to get good estimates
Generalization - what you learn in one state may apply to others
How do you receive feedback in RL, What is agents utility in RL
- receive feedback in form of rewards
- utility is reward function
Offline vs Online
Offline is MDP online is RL
What is the idea of model based learning
- the agent uses the transition model of the environment to make decisions
- Assumes learned model is correct
- learn an appx model based on experiences
What is step 1 of MBL
Learn empircal MDP model
- count outcomes s’ for each s, a
- normalize to give estimate of T(s,a,s’)
- Discover each reward when we experience (S, a, s’)
What is step 2 of MBL
- solve learned MDP
Pros and Cons of MBL
Pro - makes efficient use of experiences
Con - may not scale to large state spaces
- learns model one state-action pair at a time
What is the simplified task of passive reinforcement learning
Policy evaluation - Input - a fixed policy pi(s) - agent tries to learn the utility pi (s) - Don't know transition model and reward the goal is to learn state values
Run through passive reinforcement learning
- Agent executes set of trials in env using policy
- Agent started in the initial state and reaches one of the terminal states
- Agent percepts supply current state and reward for transition to reach that state
What is the utility in Passive reinforcement learning
- is defined as expected sum of (discounted ) rewards obtained if policy followed
What is the goal of direct evaluation
Compute values for each state under pi
What is the idea of Direct Evaluation
Average observed sample values
- Act according to pi
- write down sum of discounted rewards when visiting state
- average those samples
What does direct evaluation do to RL
reduces it to standard supervised learning with state and reward pair
Pros and cons of direct evaluation
Pros
- Easy to understand
- No knowledge of T, R required
- eventually computes correct avg values using just sample transitions
Cons
- wastes information about state connections
- each state learned separately
- violates Bellman equations
- slow
Why not use policy evaluation
Still need T and R although it exploits the connection between states
What is the idea of sample based policy evaluation
take samples of outcomes by doing the action and then average
What is the idea of Temporal difference learning TDL
- update U each time we experience transition
- likely outcomes will contribute updates more often
How does TDL learn
- Policy is still fixed , still evaluating
- Take the running average - move values toward successor value
Problems of TD value learning and solution
- cannot turn values into new policy
- solution - learn Q values , make action selection model free
Known MDP vs Unknown MDP
Known MDP
- offline solution
- policy evaluation
Unkown MDP model based
- fixed policy evaluated on appx MDP
Unkown MDP model free
- Evaluate fixed policy at value learning
- Q learning
- PRL
- DE
- TDL