7 - Reinforcement Learning Flashcards

1
Q

4 things RL is built on

A
  • A policy
  • A reward
  • A value function
  • A model of the environemtn
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Policy

A

Defines agent’s way of behaving.
Maps from states to probabilities of selecting each action

If agent follows policy π at time t, then π(a|s) is the probability that At=a given St=s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reward signal

A

Defines the goal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Value function

A

SPecifies what is good in the long term

a state s under policy π denoted vπ(s) is the expected return when starting in s and following π thereafter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Rewards

A

Immediate desirability of a stateV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Values

A

Long term desirability of a state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model

A

Predicts or simulates environment.

Model based: similar to planning
Model free RL: trial-and-error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Animal Learning

A

Behaviours that lead to reward reinforced. Behaviours that do not lead to reards are abandoned/reduced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dynamic Programming

A

Always remember answers to sub problems you solved already

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Temporal Difference

A

One stimulus, the secondary reinforcer, predicts arrival of a primary reinforcer.

Eg time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Multi Armed bandit Problems

A
  • Choose among k options
  • After each choice, you receive a numerical reward (based on the choice)
  • Maximise the reward over some time period (eg 1000 actions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

N-armed bandit problem

Each n has an expected reward. call value q

then the value of an action a is the expected reward for a
…. (if we know/don’t know)

A

If we knew q*(a) (q star subscript not q multiply a) then q(a) is the exptected value of reward Rt given action At
If we don’t then the task is to estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Greedy actions

A

Go for the greatest Qt(a).

argmax a Qt(a)

Exploitation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non-Greedy Actions

A

Choose something else other than max Qt(a).

Exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Natural way to estimate q(a) by averaging the rewards actually received (Think of sample averages)

A

Qt(a) = (sum of rewards when a taken prior to t)/(number of times a taken prior to t)

Sample average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Near-greedy actions or epsilon-greedy

A

Behave greedily most of the time but occasionally choose a random action

17
Q

How can we compute the averages Qt(a) efficiently?

This is to do with calculating the on-going estimate, not sum of/times taken

A

Qn = Qn-1 + (1/n)(Rn - Qn-1)

newEst = oldEst + 1/N[NewSample-OldEst]

18
Q

How can you give more recent rewards more weight in the average?

A

Use a non-changing step size a (well, alpha)

Qn-1 + a (Rn - Qn-1)

19
Q

Upper confidence bounds

A

Sample actions that we know little about

At = argmaxa [Qt(a)+c*sqrt(ln(t)/Nt(a))]

Nt(a) here means the number of times a has been selected prior to time t.

c >0 controls degree of exploration

20
Q

Gradient Bandit Algorithms

A

Learn a preference Ht(a) for each action.

p(At=a) = (e^(Ht(a)))/sum(b=1->k)(e^(Ht(b)))

= PIt(a) (probability of taking action a at time t)

That’s e to the power of Ht(a) divided by the sum of all b from 1 to k where e to the power of Ht(b)

21
Q

Updating action preferences gradient bandit algos (updating Ht+1)

A

For Ht+1(At) -> Ht(At)+ α(Rt-avgRt)(1-πt(At))
For Ht+1(a) ->Ht(a)- α(Rt-avgRt)πt(a)

First line is for selected action At and second is for a /= At