Reinforcement Learning all 6 exercise videos Flashcards

Question 1

Q

What is an interaction loop?

Answer

A

Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.

Question 2

Q

Two types of learning in psychology (associative learning)

Answer

A

classic conditioning

- operant conditioning

Question 3

Q

classic conditioning

Answer

A

subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).

cs = stimulus once neutral but now leads to response
us = automatic response
cr = learned response

Question 4

Q

operant conditioning

Answer

A

subject learns the relationship btw. stimulus and its behavior
stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.

Question 5

Q

reinforcement learning cycle

Answer

A

State St exisit.
Agent takes action At
Environment is influences and is now in state St+1
Agent gets reward Rt
repeat

Question 6

Q

reward hypothesis

Answer

A

Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)

Question 7

Q

Markov Process (MDP)

Answer

A

a sequence is a markov process if the probability of the next state only depends on the predecessor state
MDP: actions steer states in a desired direction

Question 8

Q

state-value function

Answer

A

is the expected return when a specific policy is followed

Question 9

Q

action-value function

Answer

A

expected return when a specific policy is followed after choosing an action in a particular state

Question 10

Q

General Policy Interation (GPI)

Answer

A

a value function depends on the policy. The policy depends on the value function.
We need to iteratively apply value evaluation and policy improvement.

Question 11

Q

How is policy evaluation called in classical conditioning

Answer

A

prediction

Question 12

Q

How is policy evaluation called in operant conditioning

Question 13

Q

DP (maybe: dynamic programming) prediction

Answer

A

bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.

Question 14

Q

value iteration

Answer

A

a variation of policy iteration that is not using exhaustive evaluation but a single sweep

Question 15

Q

Monte Carlo Prediction (MC)

Answer

A

does not require knowledge of the MDP as it learns form sampled state trajectories
MC methods are an approach to learn without prior knowledge
return is calculated for all states in each sampled trajectory. Experienced return are averaged
goal: estimate state-action values

Question 16

Q

Temporal Difference

Answer

A

mixture of DP and MC that samples and bootstraps

- Bellman equation is employed by iteratively updating value after every time step.

Question 17

Q

Dilemma

Answer

A

Learning action values is conditional on subsequent optimal behavior, but behaving non-optimally is necessary in order to explore all actions and find the optimal ones.

Question 18

Q

How can the agent learn about the optimal policy while behaving according to the exploratory policy?

Answer

A

Through off-policy learning

Question 19

Q

target policy:

Answer

A

policy that the agent estimates its function value based on

Question 20

Q

behavior policy:

Answer

A

agent behaves according to sample actions and interacts with env.

Question 21

Q

on-policy

Answer

A

behavior policy = target policy

Question 22

Q

off-policy

Answer

A

behavior policy != target policy

Question 23

Q

What does the off-policy help with?

Answer

A

dealing with the exploration problem

Question 24

Q

When is off-policy valid?

Answer

A

for a valid off-policy learning the chosen behavior must cover the target policy.

Question 25

Q

What unifies DT and MC?

Answer

A

n-step boostrapping

1-step is TD, infinite is TD and MC

Question 26

Q

key ideas methods have in common

Answer

A

Estimating value functions
back up value functions
Generalized policy iteration

Question 27

Q

Which methods have sample updates

Question 28

Q

Which methods have bootstrapping

Question 29

Q

Where is depth of update highest?

Question 30

Q

Where is width of update largest

Answer

A

DP dynamic programming

Question 31

Q

what is low depth and low width of update?

Answer

A

temporal difference