C2 Flashcards by Emilia van der Kooy

what is the goal of reinforcement learning?

to find the sequence of actions that gives the highest reward, or, more formally, to find the optimal policy that gives in each state the best action to take

generally, the objective is to achieve the highest possible average return from the start state

find the optimal policy pi*(a|s) = argmax V^pi (s_0), so for start state s_0

How well did you know this?

Not at all

Perfectly

what is a sequential decision problem?

the agent has to make a sequence of decisions in order to solve a problem

How well did you know this?

Not at all

Perfectly

what is the Markov property?

the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)

How well did you know this?

Not at all

Perfectly

how is a Markov decision process defined for reinforcement learning?

a 5-tuple:
- S: a finite set of legal states in the env
- A: a finite set of actions
- T_a(s, s’): the probability that action a in state s at time t will transition to state s’ at time t+1 in the env (internal to the env, the agent does not know this)
- R_a: the reward received after taking action a transitions state s into state s’
- gamma: the discount factor representing the difference between future and present rewards

How well did you know this?

Not at all

Perfectly

what is a stochastic environment?

a non-deterministic environment, where the outcome of an action depends on elements in the environment, that are not known to the agent

How well did you know this?

Not at all

Perfectly

transition function in model-free RL

it is implicit to the solution algorithm: the env has access to the transition function and uses it to compute the next state, but the agent does not

How well did you know this?

Not at all

Perfectly

how are action selections and rewards propagated through the tree

action selections are propagated downwards, rewards are propagated to parent states upwards

How well did you know this?

Not at all

Perfectly

what is a return

the reward of a full sequence

How well did you know this?

Not at all

Perfectly

what is the value function V^pi (s)?

the expected cumulative discounted future reward of a state (where actions are chosen according to policy pi)

How well did you know this?

Not at all

Perfectly

what is the policy?

a conditional probability distribution that for each possible state specifies the probability of each possible action, so a mapping from the state space to a probability distribution over the action space:
pi: S –> p(A)

How well did you know this?

Not at all

Perfectly

what is the state value of a terminal state?

it is by definition 0. the same goes for the state-action value of of a terminal state.

How well did you know this?

Not at all

Perfectly

what is the state-action value Q?

the estimated average return we expect to achieve when taking action a in state s and follow policy pi afterwards

maps every state-action pair to a real number:
Q: S x A –> R

How well did you know this?

Not at all

Perfectly

what is a potential benefit of using the state-action values Q instead of the state values V?

Q values directly tell what every action is worth, and then from the optimal Q-function we can obtain directly the optimal policy

How well did you know this?

Not at all

Perfectly

Bellman equation

see book equation 2.7

How well did you know this?

Not at all

Perfectly

when do we apply model-free learning?

when the exact transition probabilities are not known to the agent and the agent should be able to compute the policy without knowing these transition probabilities. The role of the transition function is replaced by an iterative sequence of environment samples

How well did you know this?

Not at all

Perfectly

bootstrapping

Study These Flashcards

solves the problem of computing a final value when we only know how to compute step-by-step intermediate values. Old estimates of a value are refined with new updates

temporal difference learning

Study These Flashcards

updating the current estimate of the state value with an error value based on the estimate of the next state it has gotten by sampling the environment

in value-based learning, how do we find the optimal policy?

Study These Flashcards

if we have the optimal state-value V*, then the optimal policy can be found by finding the action with that value. This way, we can recover the optimal policy sequence of best actions

what is the greedy approach in RL?

Study These Flashcards

maximum exploitation: taking in each state the action with the current highest Q value => this is high variance because we use only a few samples, resulting in high uncertainty

on-policy learning

Study These Flashcards

learning takes place by consistently backing up the value of the selected action back to the same behaviour policy function that was used to select the action

off-policy learning

Study These Flashcards

learning takes place by backing up values of a different action from the one that was selected by the behaviour policy. This can be more efficient in the case of exploration, because then it can back up the value of an older, better action, instead of stubbornly backing up the value of the actual action taken

SARSA advantage and disadvantage

Study These Flashcards

on-policy, see equation 2.9

advantage: it directly optimizes the target of interest and converges quickly by learning with the direct behaviour values, also more stable convergence (low variance)

disadvantage: sample inefficiency, because the target policy is updated with sub-optimal explorative rewards

Q-learning

Study These Flashcards

off-policy, see equation 2.10
uses separate behaviour and target policies: one for exploratory downward selection behaviour and one to update as the current target backup policy

can be unstable due to the max operation, but low bias

what 2 elements are central to reinforcement learning?

Study These Flashcards

agent and environment

what problems is RL applied to?

sequential decision problems

what is the name of the algorithm that computes the Bellman relation?

value iteration

what is model-free?

When the dynamics model (the reward function and the transition function) is only in the environment and the agent does not have access to it.

what is model-based?

when the agent has its own ideas of the dynamics model

how can we compute minimal regret?

the difference between the reward of the action you took and the reward of the optimal action (multi-armed bandit theory)

C2 Flashcards

(29 cards)