Reinforcement Learning all 6 exercise videos Flashcards
What is an interaction loop?
Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.
Two types of learning in psychology (associative learning)
- classic conditioning
- operant conditioning
classic conditioning
- subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).
cs = stimulus once neutral but now leads to response us = automatic response cr = learned response
operant conditioning
- subject learns the relationship btw. stimulus and its behavior
- stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.
reinforcement learning cycle
- State St exisit.
- Agent takes action At
- Environment is influences and is now in state St+1
- Agent gets reward Rt
repeat
reward hypothesis
Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)
Markov Process (MDP)
- a sequence is a markov process if the probability of the next state only depends on the predecessor state
- MDP: actions steer states in a desired direction
state-value function
is the expected return when a specific policy is followed
action-value function
expected return when a specific policy is followed after choosing an action in a particular state
General Policy Interation (GPI)
- a value function depends on the policy. The policy depends on the value function.
- We need to iteratively apply value evaluation and policy improvement.
How is policy evaluation called in classical conditioning
prediction
How is policy evaluation called in operant conditioning
control
DP (maybe: dynamic programming) prediction
- bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.
value iteration
- a variation of policy iteration that is not using exhaustive evaluation but a single sweep
Monte Carlo Prediction (MC)
- does not require knowledge of the MDP as it learns form sampled state trajectories
- MC methods are an approach to learn without prior knowledge
- return is calculated for all states in each sampled trajectory. Experienced return are averaged
- goal: estimate state-action values
Temporal Difference
- mixture of DP and MC that samples and bootstraps
- Bellman equation is employed by iteratively updating value after every time step.
Dilemma
Learning action values is conditional on subsequent optimal behavior, but behaving non-optimally is necessary in order to explore all actions and find the optimal ones.
How can the agent learn about the optimal policy while behaving according to the exploratory policy?
Through off-policy learning
target policy:
policy that the agent estimates its function value based on
behavior policy:
agent behaves according to sample actions and interacts with env.
on-policy
behavior policy = target policy
off-policy
behavior policy != target policy
What does the off-policy help with?
- dealing with the exploration problem
When is off-policy valid?
- for a valid off-policy learning the chosen behavior must cover the target policy.
What unifies DT and MC?
n-step boostrapping
1-step is TD, infinite is TD and MC
key ideas methods have in common
- Estimating value functions
- back up value functions
- Generalized policy iteration
Which methods have sample updates
MC, TD
Which methods have bootstrapping
DP, TD
Where is depth of update highest?
MC
Where is width of update largest
DP dynamic programming
what is low depth and low width of update?
temporal difference