W2 MDP & Tabular Value-based Flashcards
What is the 5-tuple of a Markov Decision Process?
(S, A, T_a, R_a, gamma)
S: state space
A: action space
T_a: transition fuction in the environment
R_a: reward
gamma: discount factor representing the difference between future and present rewards.
Value-based methods, policy-based methods work well for discrete or continuous spaces?
Value-based: Discrete
Policy-based: Discrete & Continuous
Actions & Rewards learning in a tree, which directions?
Actions: downward to the leaves
Rewards: upward, backpropagation to the root
what is a sequential decision problem?
the agent has to make a sequence of decisions in order to solve a problem
what is the Markov property?
the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)
What is a policy pi(a|s)?
a conditional probability distribution that for each possible state specifies the probability of each possible action.
Whatβs on-policy?
The learning takes place by consistently backing up the value of the selected action back to the same behavior policy function that was used to select the action
SARSA is on-policy
Whatβs off-policy?
the learning takes place by backing up values of another action, not the one selected by the behavior policy
Q-learning is off-policy and it is greedy: backup the value of the best action
convergence in off-policy can be slower, since older, non-current, values are used.
The behavior policy and the target policy are different in Q-learning, but they are the same in SARSA.
SARSA update formula? (πΌ=learning rate, πΎ=discount factor)
π(π π‘, ππ‘) β π(π π‘, ππ‘) + πΌ[ππ‘+1 + πΎπ(π π‘+1, ππ‘+1) β π(π π‘, ππ‘)]
Q-learning update formula?
π(π π‘, ππ‘) β π(π π‘, ππ‘) + πΌ[ππ‘+1 + πΎ maxππ(π π‘+1, π) β π(π π‘, ππ‘)]
In reinforcement learning the agent can choose which training examples are generated. Why is this beneficial? What is a potential problem?
We can generate a endless dataset ourselves, through simulations. But on the other hand, we donβt have the βgold standardβ actions given a state, nothing is labeled. We have to derive the correct policy ourselves.
What is Grid world?
Grid worlds are the simplest environments, it consists of a rectangular grid of squares, with a start square, and a goal square.
In a tree diagram, is successor selection of behavior up or down?
In a tree diagram, is learning values through backpropagation up or down?
Learning up
Selecting down
What is π?
What is π(s)
What is π(π )?
What is π(π , π)?
π: Trace, a full rollout of a simulation.
π(s): the policy function answers the question how the different actions π at state π should be chosen
π(π ): The expected cumulative discounted future reward of a state
π(π , π): Q-value estimate, estimated value of taking action π in state π
What is dynamic programming?
In the context of RL, in Dynamic Programming we recursively traverse the state space. An example algorithm is Value-iteration.