7 - Reinforcement Learning Flashcards

Question 1

Q

4 things RL is built on

Answer

A

A policy
A reward
A value function
A model of the environemtn

Question 2

Q

Policy

Answer

A

Defines agent’s way of behaving.
Maps from states to probabilities of selecting each action

If agent follows policy π at time t, then π(a|s) is the probability that At=a given St=s

Question 3

Q

Reward signal

Answer

A

Defines the goal

Question 4

Q

Value function

Answer

A

SPecifies what is good in the long term

a state s under policy π denoted vπ(s) is the expected return when starting in s and following π thereafter.

Question 5

Q

Rewards

Answer

A

Immediate desirability of a stateV

Question 6

Q

Values

Answer

A

Long term desirability of a state

Question 7

Q

Model

Answer

A

Predicts or simulates environment.

Model based: similar to planning
Model free RL: trial-and-error

Question 8

Q

Animal Learning

Answer

A

Behaviours that lead to reward reinforced. Behaviours that do not lead to reards are abandoned/reduced

Question 9

Q

Dynamic Programming

Answer

A

Always remember answers to sub problems you solved already

Question 10

Q

Temporal Difference

Answer

A

One stimulus, the secondary reinforcer, predicts arrival of a primary reinforcer.

Eg time.

Question 11

Q

Multi Armed bandit Problems

Answer

A

Choose among k options
After each choice, you receive a numerical reward (based on the choice)
Maximise the reward over some time period (eg 1000 actions)

Question 12

Q

N-armed bandit problem

Each n has an expected reward. call value q

then the value of an action a is the expected reward for a
…. (if we know/don’t know)

Answer

A

If we knew q*(a) (q star subscript not q multiply a) then q(a) is the exptected value of reward Rt given action At
If we don’t then the task is to estimate

Question 13

Q

Greedy actions

Answer

A

Go for the greatest Qt(a).

argmax a Qt(a)

Exploitation

Question 14

Q

Non-Greedy Actions

Answer

A

Choose something else other than max Qt(a).

Exploration

Question 15

Q

Natural way to estimate q(a) by averaging the rewards actually received (Think of sample averages)

Answer

A

Qt(a) = (sum of rewards when a taken prior to t)/(number of times a taken prior to t)

Sample average

Question 16

Q

Near-greedy actions or epsilon-greedy

Answer

Study These Flashcards

A

Behave greedily most of the time but occasionally choose a random action

Question 17

Q

How can we compute the averages Qt(a) efficiently?

This is to do with calculating the on-going estimate, not sum of/times taken

Answer

Study These Flashcards

A

Qn = Qn-1 + (1/n)(Rn - Qn-1)

newEst = oldEst + 1/N[NewSample-OldEst]

Question 18

Q

How can you give more recent rewards more weight in the average?

Answer

Study These Flashcards

A

Use a non-changing step size a (well, alpha)

Qn-1 + a (Rn - Qn-1)

Question 19

Q

Upper confidence bounds

Answer

Study These Flashcards

A

Sample actions that we know little about

At = argmaxa [Qt(a)+c*sqrt(ln(t)/Nt(a))]

Nt(a) here means the number of times a has been selected prior to time t.

c >0 controls degree of exploration

Question 20

Q

Gradient Bandit Algorithms

Answer

Study These Flashcards

A

Learn a preference Ht(a) for each action.

p(At=a) = (e^(Ht(a)))/sum(b=1->k)(e^(Ht(b)))

= PIt(a) (probability of taking action a at time t)

That’s e to the power of Ht(a) divided by the sum of all b from 1 to k where e to the power of Ht(b)

Question 21

Q

Updating action preferences gradient bandit algos (updating Ht+1)

Answer

Study These Flashcards

A

For Ht+1(At) -> Ht(At)+ α(Rt-avgRt)(1-πt(At))
For Ht+1(a) ->Ht(a)- α(Rt-avgRt)πt(a)

First line is for selected action At and second is for a /= At

7 - Reinforcement Learning Flashcards

(21 cards)