Module 8 Flashcards
What is reinforcement learning
- based on rewarding desired behaviors / punishing undesired ones
What is reinforcement agent capable of
- is able to perceive and interpret its env to take actions and learn through trial and error
Where can reinforcement learning operate?
as long as a clear reward can be applied
What is optimal policy
- yields highest expected utility
What does a Markov decision process contain
- Possible world states S
- Set of Models
- Set of possible actions A
- reward function R(s,a)
- A policy the solution of MDP
What is a state in MDP
- set of tokens that represent every state the agent can be in
What is a model / transition model in MDP
- Gives an action’s effect in a state
How is the transition model defined
- defined by T(S, a, S’)
- in state S, take action A, ends in State S’
How does the modal differ in stochastic actions?
Add in probability P(S’| S,a) - probability of S’ given S and a
What is they key feature of Markov Property
effects of an action taken in a state depend only on that state and not prior history
What is an action in MDP
- set of all possible action
- A(s) defines the set of actions that can eb taken given state s
What is a reward in MDP
- real values reward function
- R(s) indicates reward for being in state s
- R(s,a) indicates reward being in state s after taking action a
- R(S, a, S’) indicates reward for being in state S’ from S after action A
What is policy in MDP
- solution to the MDP
- maps from S to a
- indicates action a to be taken while in state S
What do MDP solutions usually involve?
dynamic programming
- recursively breaking a problem into pieces while remembering optimal solutions to each piece
How is the quality measured of a policy
- measured through expected utility
- denoted by pi*
What is the goal of MDP and what role does RL play
Goal - maximize cumulative reward in LT
RL - transitions and rewards usually not available
- how to change policy given experience
- how to explore environment
Describe Episodic vs continuing tasks in MDP (optimality/horizon)?
Episodic
- finite horzion = game ends after N steps
- optimal policy depends on N - harder to analyze
- Policy depends on time = nonstationary
Continuing tasks
- infinite horizon = no time limit
- optimal action depends on current state and is stationary