Reinforcement Learning Flashcards
What is reinforcement learning?
Operant conditioning
An agent interacting with its environment by building up an internal model through trial and error
Components of an RL agent?
State
Transition
Action
Reward
Whats the difference between deterministic and stochastic actions?
Deterministic - choice action will always occur
Stochastic - choice action has a chance to occur where probability represents uncertainty
How do we represent solutions?
Using a policy: mapping from states to actions
How do we evaluate policies?
Deterministic: Sum total rewards for following a policy
Stochastic: sum expected rewards for following a policy
What are the two RL algorithm types? Describe them.
- Model based - agent knows STAR
2. Model free - agent does not know T and R and must learn them through trial and error
Explain model based RL.
Agent is given: All STATES in the environment Set of all ACTIONS in each state TRANSITION probabilities between s and s' given a REWARD for each action in each state
Explain model free RL.
Model tries different actions in different states to build an estimate of TRANSITION probabilities and REWARDS for performing actions.
Exploration vs Exploitation
Exploration: Finds more info about environment
Exploitation: Uses known info to maximise reward
Passive vs Active RL
Passive: agent executes a fixed policy then evaluates it
Active: agent updates a policy as it learns
Fully vs Partially Observable Environments
Fully: agent is initialised with state information and reward transition functions - knows current state, actions to transition to next state and reward for doing so
Partially: Agent has an internal model of the environment that it refines through trial and error where it can better learn states and transition functions
What is a Markov Decision Process
A model of an environment that consists of:
- finite # of states
- probabilistic transitions between states
- possible actions at a state
- rewards for performing a specific action in a specific state
What is the Markov Property?
A Markov process is a stochastic process who’s future state only depends on the current state and current action – not on past states/actions
Future is independent of the past given the present
What are the 4 types of Markov Models
- MDP - control over state transitions, fully observable
- POMDP - control over state transition, partially observable
- Markov Chain
- HMM
Note that MDP and POMDP describe deterministic actions. The other 2 are the stochastic versions of the first
What is the purpose of Gamma?
Discount rewards
We can control how much the agent cares about the future
Gamma close to 1 => agent cares a lot about the distant future
Hyper parameter
What are some of the advantages of value iteration?
Will converge
Good for small number of states
What are some disadvantages of value iteration?
Slow to converge for many states
Knowing T and R is unrealistic
How do you perform value iteration?
Initialise the value of each state to zero and iteratively refine V(s) at each time step as it converges to V*(S)
Stope when the difference between the values of 2 consecutive iterations is below some threshold theta
What is Direct Evaluation?
Model free RL
Every time you visit a state write down what the sum of rewards turn out to be then average those
Does this for every state to estimate T and R
can then start exploiting
How do you perform direct evaluation?
- Choose a starting state
- Choose an action
- Set every state traversed to -1 to ensure no backtracking
- Average the sum of each value in a policy associated with a particular starting state every terminal state
- Repeat for all states
That value will be the estimated R of starting state S
What are some advantages of direct evaluation?
Easy to understand
Do not need T and R
Eventually computes correct state values
What are some of the disadvantages of direct evaluation?
Takes long
Each state learned separately
What is policy iteration?
Model - based learning
Initialise, Evaluate, Improve
How to perform policy iteration?
Initialise an arbitrary policy - evaluate it using bellman’s (value iteration) - check if it can be improved by checking that the best action for every state is the one we currently have in our policy
What are some advantages of policy iteration?
Converges on an optimal policy
What are some disadvantages of policy iteration?
Slow.
What is Q learning?
Model Free RL POMDP In every iteration: 1. agent observes state s 2. takes action a 3. observes new state s' 4. receives reward r
What is the basis of Q learning? Explain this concept.
Temporal credit assignment. Agent takes a path that maximises total future reward i.e. the agent will not follow a greedy policy - balances current and future rewards.
What is temporal difference learning?
Basis of Q learning
Adjust the estimated value of a state based on immediate reward and the estimated value of the next state
How do you Q learn?
- Init q table = 0
- observe current state s
- loop
a. select and action
b. receive immediate reward and observe new state s’
c. update Q table using formula.
What are some Q learning properties?
Converges if:
Enough exploration
Exploits over time
learning rate between 0 and 1
Please read over methods of changing learning rate
What does learning rate do? Q learning
Impacts how drastically Q value of current state changes
How do we choose actions in Q learning?
Greedy Epsilon approach
Epsilon determines the likelihood we take a random action
Over time e must decrease
What are the three types of learning in multi-agent systems?
Cooperation
Competition
Individual