Reinforcement Learning Flashcards
What is reinforcement learning?
Operant conditioning
An agent interacting with its environment by building up an internal model through trial and error
Components of an RL agent?
State
Transition
Action
Reward
Whats the difference between deterministic and stochastic actions?
Deterministic - choice action will always occur
Stochastic - choice action has a chance to occur where probability represents uncertainty
How do we represent solutions?
Using a policy: mapping from states to actions
How do we evaluate policies?
Deterministic: Sum total rewards for following a policy
Stochastic: sum expected rewards for following a policy
What are the two RL algorithm types? Describe them.
- Model based - agent knows STAR
2. Model free - agent does not know T and R and must learn them through trial and error
Explain model based RL.
Agent is given: All STATES in the environment Set of all ACTIONS in each state TRANSITION probabilities between s and s' given a REWARD for each action in each state
Explain model free RL.
Model tries different actions in different states to build an estimate of TRANSITION probabilities and REWARDS for performing actions.
Exploration vs Exploitation
Exploration: Finds more info about environment
Exploitation: Uses known info to maximise reward
Passive vs Active RL
Passive: agent executes a fixed policy then evaluates it
Active: agent updates a policy as it learns
Fully vs Partially Observable Environments
Fully: agent is initialised with state information and reward transition functions - knows current state, actions to transition to next state and reward for doing so
Partially: Agent has an internal model of the environment that it refines through trial and error where it can better learn states and transition functions
What is a Markov Decision Process
A model of an environment that consists of:
- finite # of states
- probabilistic transitions between states
- possible actions at a state
- rewards for performing a specific action in a specific state
What is the Markov Property?
A Markov process is a stochastic process who’s future state only depends on the current state and current action – not on past states/actions
Future is independent of the past given the present
What are the 4 types of Markov Models
- MDP - control over state transitions, fully observable
- POMDP - control over state transition, partially observable
- Markov Chain
- HMM
Note that MDP and POMDP describe deterministic actions. The other 2 are the stochastic versions of the first
What is the purpose of Gamma?
Discount rewards
We can control how much the agent cares about the future
Gamma close to 1 => agent cares a lot about the distant future
Hyper parameter