Reinforcement Learning Flashcards
What is a trajectory?
A trajectory τ is a sequence of states and actions in an environment.
τ = (s0, a0, s1, a2, …)
What are other names for a trajectory?
An episode or a rollout.
What is a reward function?
The reward function of an envrionment measures how good state-action pairs are:
rs = R(st,at)
What is the return of a trajectory?
The return is the measure of cumulative reward along it.
What is the finite horizon undiscounted sum method of calculating return?
Finite horizon undiscounted sum of rewards:
What is the infinite horizon discounted sum method of calculating return?
Infinite horizon discounted sum of rewards:
What is a policy?
A policy π is a rule for selecting actions. It can be stochastic or deterministic.
What is a stochastic policy?
A stochastic policy gives a probability distribution over actions, and actions are selected randomly based on that distribution.
What is a deterministic policy?
A deterministic policy maps π directly to an action.
What is the goal of reinforcement learning?
To learn a policy which maximizes expected return.
What is the optimal policy π*?
The optimal policy π* is:
What are the two main approaches for solving the optimal policy problem?
- Policy Optimization
- Q-Learning