7 - Reinforcement Learning 2 Flashcards

1
Q

Finite Markov Design Processes

A

Mathematically idealised form of the reinforcement learning problem

eg Graph that uses states (as circles) and actions (as different circles), with rewards (arrows) and probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Markov Property

Hint: States depend on…

A

The next state s’ depends on the current state s and the decision maker’s action.

but, given s and a, s’ is conditionally independent of all previous states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Markov Chains

A

-Multiple states.
-Agent that transitions.
- time measured in time steps
- set of states eg {happy,hungry, sad}
- Transition function (gives prob of switching from one state to another)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hidden Markov Model

A

Like a markov chain, but the states are hidden.

For the mood example, an observation function tells us the joint probability of each observation for each state. Eg O(Hungry,Eating) = 0.5
O(Hungry,Crying) = 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Markov chain Transition function

A

Gives the probability of switching from one state to another in a chain

eg T(Happy->Hungry) = 0.4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Markov Decision Process, compared to markov chain

A

Like a markov chain, with addition of actions.

Example T(Happy, Play -> Hungry) = 0.3
Each state can have one or more action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Partially Observable MDP

A

An MDP where states are hidden

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MDP Agent-environment Interaction

A

Environment gives state to agent.
Agent gives action
Environment gives reward and next state

What happens next must only depend on our current state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MDP Reward hypothesis

A

Goals and purposes can be thought of as the max of the expected value of the cumulative sum of a received scalar signal (reward)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sum of rewards converges to a finite factor?

A

Yes,
Gt = Rt+1 + γRt+2 + (γ^2)Rt+3 +…
= sum(k->0 to inf)((γ^k)*Rt+k+1,)

gamma γ, is the discount factor (chance of the reward i think?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sum of rewards when we know some rewards already

A

Gt = Rt + γ*Gt+1
(γ is a parameter 0<γ<1 discount rate)

Think of the original sum of rewards expression
Gt = Rt + γGt+1 + (γ^2)Gt+2..
Factorise γ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reward vs Value

A

Instant reward might be available but the move may be counterproductive in the long term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

action-value function qπ

From the sum of rewards Gt

A

q(s,a) = En[Gt|St=s,At=a]

(Gt is the sum of rewards
En - denotes the expected value of a random variable given that the agent
follows policy π, and t is any time step)
We can estimate it from experience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Monte Carlo Methods for estimating action-value function

A

Sample and average returns for each state-action pair (like bandit methods).

Diff is there are multiple states, each acting like a different bandit problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Objective of Monte Carlo Methods

A

To learn vπ(s).

The value function at state s in policy π

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Temporal Difference in context of rewards

How to update value function. NOT the definition of temporal difference

A

Update value function by adding value we already have with step size*difference of real and estimate

V(St) <- V(St) + a[Gt-V(St)]
Gt can be substituted for Rt+γ*V(St+1)

17
Q

TD error

A

update using this using the difference between estimated S value and the better estimate

V(St) <- V(St)+alpha[Rt+1 + γV(St+1) - V(St)]

It may be helpful to notice this structure is similar to Q learning