C2 Flashcards
what is the goal of reinforcement learning?
to find the sequence of actions that gives the highest reward, or, more formally, to find the optimal policy that gives in each state the best action to take
generally, the objective is to achieve the highest possible average return from the start state
find the optimal policy pi*(a|s) = argmax V^pi (s_0), so for start state s_0
what is a sequential decision problem?
the agent has to make a sequence of decisions in order to solve a problem
what is the Markov property?
the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)
how is a Markov decision process defined for reinforcement learning?
a 5-tuple:
- S: a finite set of legal states in the env
- A: a finite set of actions
- T_a(s, s’): the probability that action a in state s at time t will transition to state s’ at time t+1 in the env (internal to the env, the agent does not know this)
- R_a: the reward received after taking action a transitions state s into state s’
- gamma: the discount factor representing the difference between future and present rewards
what is a stochastic environment?
a non-deterministic environment, where the outcome of an action depends on elements in the environment, that are not known to the agent
transition function in model-free RL
it is implicit to the solution algorithm: the env has access to the transition function and uses it to compute the next state, but the agent does not
how are action selections and rewards propagated through the tree
action selections are propagated downwards, rewards are propagated to parent states upwards
what is a return
the reward of a full sequence
what is the value function V^pi (s)?
the expected cumulative discounted future reward of a state (where actions are chosen according to policy pi)
what is the policy?
a conditional probability distribution that for each possible state specifies the probability of each possible action, so a mapping from the state space to a probability distribution over the action space:
pi: S –> p(A)
what is the state value of a terminal state?
it is by definition 0. the same goes for the state-action value of of a terminal state.
what is the state-action value Q?
the estimated average return we expect to achieve when taking action a in state s and follow policy pi afterwards
maps every state-action pair to a real number:
Q: S x A –> R
what is a potential benefit of using the state-action values Q instead of the state values V?
Q values directly tell what every action is worth, and then from the optimal Q-function we can obtain directly the optimal policy
Bellman equation
see book equation 2.7
when do we apply model-free learning?
when the exact transition probabilities are not known to the agent and the agent should be able to compute the policy without knowing these transition probabilities. The role of the transition function is replaced by an iterative sequence of environment samples