Reinforcement Learning all 6 exercise videos Flashcards
What is an interaction loop?
Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.
Two types of learning in psychology (associative learning)
- classic conditioning
- operant conditioning
classic conditioning
- subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).
cs = stimulus once neutral but now leads to response us = automatic response cr = learned response
operant conditioning
- subject learns the relationship btw. stimulus and its behavior
- stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.
reinforcement learning cycle
- State St exisit.
- Agent takes action At
- Environment is influences and is now in state St+1
- Agent gets reward Rt
repeat
reward hypothesis
Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)
Markov Process (MDP)
- a sequence is a markov process if the probability of the next state only depends on the predecessor state
- MDP: actions steer states in a desired direction
state-value function
is the expected return when a specific policy is followed
action-value function
expected return when a specific policy is followed after choosing an action in a particular state
General Policy Interation (GPI)
- a value function depends on the policy. The policy depends on the value function.
- We need to iteratively apply value evaluation and policy improvement.
How is policy evaluation called in classical conditioning
prediction
How is policy evaluation called in operant conditioning
control
DP (maybe: dynamic programming) prediction
- bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.
value iteration
- a variation of policy iteration that is not using exhaustive evaluation but a single sweep
Monte Carlo Prediction (MC)
- does not require knowledge of the MDP as it learns form sampled state trajectories
- MC methods are an approach to learn without prior knowledge
- return is calculated for all states in each sampled trajectory. Experienced return are averaged
- goal: estimate state-action values