C2 Flashcards

1
Q

what is the goal of reinforcement learning?

A

to find the sequence of actions that gives the highest reward, or, more formally, to find the optimal policy that gives in each state the best action to take

generally, the objective is to achieve the highest possible average return from the start state

find the optimal policy pi*(a|s) = argmax V^pi (s_0), so for start state s_0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a sequential decision problem?

A

the agent has to make a sequence of decisions in order to solve a problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the Markov property?

A

the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how is a Markov decision process defined for reinforcement learning?

A

a 5-tuple:
- S: a finite set of legal states in the env
- A: a finite set of actions
- T_a(s, s’): the probability that action a in state s at time t will transition to state s’ at time t+1 in the env (internal to the env, the agent does not know this)
- R_a: the reward received after taking action a transitions state s into state s’
- gamma: the discount factor representing the difference between future and present rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a stochastic environment?

A

a non-deterministic environment, where the outcome of an action depends on elements in the environment, that are not known to the agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

transition function in model-free RL

A

it is implicit to the solution algorithm: the env has access to the transition function and uses it to compute the next state, but the agent does not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how are action selections and rewards propagated through the tree

A

action selections are propagated downwards, rewards are propagated to parent states upwards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is a return

A

the reward of a full sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the value function V^pi (s)?

A

the expected cumulative discounted future reward of a state (where actions are chosen according to policy pi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the policy?

A

a conditional probability distribution that for each possible state specifies the probability of each possible action, so a mapping from the state space to a probability distribution over the action space:
pi: S –> p(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the state value of a terminal state?

A

it is by definition 0. the same goes for the state-action value of of a terminal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the state-action value Q?

A

the estimated average return we expect to achieve when taking action a in state s and follow policy pi afterwards

maps every state-action pair to a real number:
Q: S x A –> R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is a potential benefit of using the state-action values Q instead of the state values V?

A

Q values directly tell what every action is worth, and then from the optimal Q-function we can obtain directly the optimal policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bellman equation

A

see book equation 2.7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

when do we apply model-free learning?

A

when the exact transition probabilities are not known to the agent and the agent should be able to compute the policy without knowing these transition probabilities. The role of the transition function is replaced by an iterative sequence of environment samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

bootstrapping

A

solves the problem of computing a final value when we only know how to compute step-by-step intermediate values. Old estimates of a value are refined with new updates

17
Q

temporal difference learning

A

updating the current estimate of the state value with an error value based on the estimate of the next state it has gotten by sampling the environment

18
Q

in value-based learning, how do we find the optimal policy?

A

if we have the optimal state-value V*, then the optimal policy can be found by finding the action with that value. This way, we can recover the optimal policy sequence of best actions

19
Q

what is the greedy approach in RL?

A

maximum exploitation: taking in each state the action with the current highest Q value => this is high variance because we use only a few samples, resulting in high uncertainty

20
Q

on-policy learning

A

learning takes place by consistently backing up the value of the selected action back to the same behaviour policy function that was used to select the action

21
Q

off-policy learning

A

learning takes place by backing up values of a different action from the one that was selected by the behaviour policy. This can be more efficient in the case of exploration, because then it can back up the value of an older, better action, instead of stubbornly backing up the value of the actual action taken

22
Q

SARSA advantage and disadvantage

A

on-policy, see equation 2.9

advantage: it directly optimizes the target of interest and converges quickly by learning with the direct behaviour values, also more stable convergence (low variance)

disadvantage: sample inefficiency, because the target policy is updated with sub-optimal explorative rewards

23
Q

Q-learning

A

off-policy, see equation 2.10
uses separate behaviour and target policies: one for exploratory downward selection behaviour and one to update as the current target backup policy

can be unstable due to the max operation, but low bias

24
Q

what 2 elements are central to reinforcement learning?

A

agent and environment

25
Q

what problems is RL applied to?

A

sequential decision problems

26
Q

what is the name of the algorithm that computes the Bellman relation?

A

value iteration

27
Q

what is model-free?

A

When the dynamics model (the reward function and the transition function) is only in the environment and the agent does not have access to it.

28
Q

what is model-based?

A

when the agent has its own ideas of the dynamics model

29
Q

how can we compute minimal regret?

A

the difference between the reward of the action you took and the reward of the optimal action (multi-armed bandit theory)